energy, entropy and information potential for...
TRANSCRIPT
ENERGY, ENTROPY AND INFORMATION POTENTIAL FOR NEURAL COMPUTATION
By
DONGXIN XU
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF
THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
1999
To My Parents
ii
’ Ph.D
rse of
spite of
who
the
d me
preci-
ore on
retic
ald
d dis-
ually
NEL
. The
The
ACKNOWLEDGEMENTS
This Chinese poem exactly expresses my feeling and experience in four years
study. During this period, there have been difficulties encountered both in the cou
my research and in my daily life. Just as the poem says, there are always hopes in
difficulties. Retrospecting the past, I would like to express my gratitude to individuals
brought me hope and light which guided me go through the darkness.
First, I would like to thank my advisor, Dr. José Principe, for providing me with
wonderful opportunity to be a Ph.D student in CNEL. Its excellent environment helpe
a lot when I just came here. I was impressed by Dr. Principe’s active thought and ap
ated very much his style of supervision which give a lot of space to students to expl
their own. I am grateful for his introducing me to the area of the information-theo
learning and the guidance throughout the development of this dissertation.
I would also like to thank my committee members Dr. John Harris, Dr. Don
Childers, Dr. Jacob Hammer, Dr. Mark Yang and Dr. Tan Wong for their guidance an
cussion they provided. Their comments are critical and constructive.
Special thank goes to John Fisher for introducing his work to me, which act
inspired this work. Special thank also goes to Chuan Wang for introducing me to C
and the friendship he provided. The discussions with Hsiao-Chun Wu were fruitful
special thank is also due to him. I would also like to thank the other CNEL fellows.
iii
list includes, but not limited to, Likang Yen, Craig Fancourt, Frank Candocia, Qun Zhao
for their help and friendship.
I would like to thank my brother, sister and my friend Yuan Yao for their constant love,
support and encouragement.
Finally, I would like to thank my wife, Shu, for her love, support, patience and sacri-
fice, which made this dissertation possible.
iv
TABLE OF CONTENTS
Page
ACKNOWLEDGEMENTS ..........................................................................................
ABSTRACT ..................................................................................................................
CHAPTERS
1 INTRODUCTION ............................................................................................ 1
1.1 Information and Energy: A Brief Review ................................................ 11.2 Motivation ................................................................................................ 61.3 Outline ..................................................................................................... 15
2 ENERGY, ENTROPY AND INFORMATION POTENTIAL ........................ 17
2.1 Energy, Entropy and Information of Signals ........................................... 172.1.1 Energy of Signals ............................................................................ 172.1.2 Information Entropy ....................................................................... 202.1.3 Geometrical Interpretation of Entropy ............................................ 242.1.4 Mutual Information ......................................................................... 272.1.5 Quadratic Mutual Information ........................................................ 312.1.6 Geometrical Interpretation of Mutual Information ......................... 382.1.7 Energy and Entropy for Gaussian Signal ........................................ 392.1.8 Cross-Correlation and Mutual Information for Gaussian Signal .... 42
2.2 Empirical Energy, Entropy and MI: Problem and Literature Review ..... 442.2.1 Empirical Energy ............................................................................ 442.2.2 Empirical Entropy and Mutual Information: The Problem ............ 442.2.3 Nonparametric Density Estimation ................................................. 462.2.4 Empirical Entropy and Mutual Information: The Literature Review 51
2.3 Quadratic Entropy and Information Potential .......................................... 572.3.1 The Development of Information Potential .................................... 572.3.2 Information Force (IF) .................................................................... 592.3.3 The Calculation of Information Potential and Force ...................... 60
2.4 Quadratic Mutual Information and Cross Information Potential ............. 622.4.1 QMI and Cross Information Potential (CIP) ................................... 622.4.2 Cross Information Forces (CIF) ...................................................... 652.4.3 An Explanation to QMI .................................................................. 66
iii
viii
v
Page
111112
. 113114118120121
. 127. 133. 134
138
138138139142
3 LEARNING FROM EXAMPLES .................................................................... 68
3.1 Learning System ...................................................................................... 683.1.1 Static Models .................................................................................. 693.1.2 Dynamic Models ............................................................................. 74
3.2 Learning Mechanisms .............................................................................. 783.2.1 Learning Criteria ............................................................................. 793.2.2 Optimization Techniques ................................................................ 83
3.3 General Point of View ............................................................................. 903.3.1 InfoMax Principle ........................................................................... 903.3.2 Other Similar Information-Theoretic Schemes ............................... 913.3.3 A General Scheme .......................................................................... 953.3.4 Learning as Information Transmission Layer-by-Layer ................. 963.3.5 Information Filtering: Filtering beyond Spectrum .......................... 97
3.4 Learning by Information Force ................................................................ 973.5 Discussion of Generalization by Learning ............................................... 99
4 LEARNING WITH ON-LINE LOCAL RULE: A CASE STUDY ON GENERALIZED EIGENDECOMPOSITION ............................................ 101
4.1 Energy, Correlation and Decorrelation for Linear Model ....................... 1014.1.1 Signal Power, Quadratic Form, Correlation,
Hebbian and Anti-Hebbian Learning .......................................... 1024.1.2 Lateral Inhibition Connections, Anti-Hebbian Learning and
Decorrelation ............................................................................... 1034.2 Eigendecomposition and Generalized Eigendecomposition .................... 105
4.2.1 The Information-Theoretic Formulation for Eigendecomposition and Generalized Eigendecomposition ......................................... 106
4.2.2 The Formulation of Eigendecomposition and Generalized Eigendecomposition Based on the Energy Measures ................. 109
4.3 The On-line Local Rule for Eigendecomposition .................................... 1114.3.1 Oja’s Rule and the First Projection .................................................4.3.2 Geometrical Explanation to Oja’s Rule ..........................................4.3.3 Sanger’s Rule and the Other Projections .......................................4.3.4 APEX Model: The Local Implementation of Sanger’s Rule ..........
4.4 An Iterative Method for Generalized Eigendecomposition .....................4.5 An On-line Local Rule for Generalized Eigendecomposition .................
4.5.1 The Proposed Learning Rule for the First Projection .....................4.5.2 The Proposed Learning Rules for the Other Connections .............
4.6 Simulations .............................................................................................4.7 Conclusion and Discussion .....................................................................
5 APPLICATIONS ..............................................................................................
5.1 Aspect Angle Estimation for SAR Imagery ............................................5.1.1 Problem Description .......................................................................5.1.2 Problem Formulation ......................................................................5.1.3 Experiments of Aspect Angle Estimation .......................................
vi
Page
5.1.4 Occlusion Test on Aspect Angle Estimation .................................. 1495.2 Automatic Target Recognition (ATR) ..................................................... 152
5.2.1 Problem Description and Formulation ............................................ 1525.2.2 Experiment and Result .................................................................... 155
5.3 Training MLP Layer-by-Layer with CIP ................................................. 1605.4 Blind Source Separation and Independent Component Analysis ............ 164
5.4.1 Problem Description and Formulation ............................................ 1645.4.2 Blind Source Separation with CS-QMI (CS-CIP) .......................... 1655.4.3 Blind Source Separation by Maximizing Quadratic Entropy ......... 1675.4.4 Blind Source Separation with ED-QMI (ED-CIP)
and MiniMax Method .................................................................. 171
6 CONCLUSIONS AND FUTURE WORK ....................................................... 179
APENDICES
A THE INTEGRATION OF THE PRODUCT OF GAUSSIAN KERNELS ...... 182B SHANNON ENTROPY OF MULTI-DIMENSIONAL
GAUSSIAN VARIABLE ............................................................................ 185C RENYI ENTROPY OF MULTI-DIMENSIONAL
GAUSSIAN VARIABLE ............................................................................ 186D H-C ENTROPY OF MULTI-DIMENSIONAL GAUSSIAN VARIABLE ..... 187
REFERENCES ............................................................................................................. 188
BIOGRAPHICAL SKETCH ........................................................................................ 197
vii
or the
use
rma-
ntral
on of
iterion
upled
ples,
hod,
d to the
iza-
ABSTRACT
Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy
ENERGY, ENTROPY AND INFORMATION POTENTIAL FOR NEURAL COMPUTATION
By
Dongxin Xu
May 1999
Chairman: Dr. José C. PrincipeMajor Department: Electrical and Computer Engineering
The major goal of this research is to develop general nonparametric methods f
estimation of entropy and mutual information, giving a unifying point of view for their
in signal processing and neural computation. In many real world problems, the info
tion is carried solely by data samples without any other a priori knowledge. The ce
issue of “learning from examples” is to estimate energy, entropy or mutual informati
a variable only from its samples and adapt the system parameters by optimizing a cr
based on the estimation.
By using alternative entropy measures such as Renyi’s quadratic entropy, co
with the Parzen window estimation of the probability density function for data sam
we developed an “information potential” method for entropy estimation. In this met
data samples are treated as physical particles and the entropy turns out to be relate
potential energy of these “information particles.” The entropy maximization or minim
viii
ten-
c, we
utual
ua-
rma-
er by
pa-
ram-
ey are
oten-
y and
sition
ld is
n this
s pro-
blems
cogni-
lind
nfirms
tion is then equivalent to the minimization or the maximization of the “information po
tial.” Based on the Cauchy-Schwartz inequality and the Euclidean distance metri
further proposed the quadratic mutual information as an alternative to Shannon’s m
information. There is also a “cross information potential” implementation for the q
dratic mutual information that measures the correlation between the “marginal info
tion potentials” at several levels. “Learning from examples” at the output of a mapp
the “information potential” or the “cross information potential” is implemented by pro
gating the “information force” or the “cross information force” back to the system pa
eters. Since the criteria are decoupled from the structure of learning machines, th
general learning schemes. The “information potential” and the “cross information p
tial” provide a microscopic expression for the macroscopic measure of the entrop
mutual information at the data sample level. The algorithms examine the relative po
of each data pair and thus have a computational complexity of .
An on-line local algorithm for learning is also discussed, where the energy fie
related to the famous biological Hebbian and anti-Hebbian learning rules. Based o
understanding, an on-line local algorithm for the generalized eigendecomposition i
posed.
The information potential methods have been successfully applied to various pro
such as aspect angle estimation in synthetic aperture radar (SAR) imagery, target re
tion in SAR imagery, layer-by-layer training of multilayer neural networks and b
source separation. The good performance of the methods on various problems co
the validity and efficiency of the information potential methods.
O N2( )
ix
is to
grams
hether
and
nt. It
ments.
ly of
n of
ental
c and
ergy or
er, the
of the
energy
at the
CHAPTER 1
INTRODUCTION
1.1 Information and Energy: A Brief Review
Information plays an important role both in the life of a person and of a society, espe-
cially in today’s information age. The basic purpose of all kinds of scientific research
obtain information in a particular area. One of the most important tasks of space pro
is to get information about cosmic space and celestial bodies, such as evidence w
there is life on Mars. A central problem of the Internet is how to transmit, process
store information in computer networks. “Like it or not, we are information depende
is a commodity as vital as the air we breathe, as any of our metabolic energy require
For better or worse, we’re all inescapably embedded in a universe of flows, not on
matter and energy but also of whatever it is we call information” [You87: page 1].
The notion of information is so fundamental and universal that only the notio
energy can be compared with it. The parallel and analogy of these two fundam
notions are well known. Most of the greatest inventions and discoveries in scientifi
human history can be related to either the conversion, transfer, and storage of en
the transmission and storage of information. For instance, the use of fire and wat
invention of simple machines such as the lever and the wheel, and the invention
steam-engine, the discoveries of electricity and atomic energy are all connected to
while the appearance of speech in the prehistoric times and the invention of writing
1
2
dawn of human history, followed by the invention of paper, printing, telegraph, photogra-
phy, telephone, radio, television and finally the computer and the computer network are
examples of information. Many inventions and discoveries can be used for both purposes.
Fire, as an example, can be used for cooking, heating and transmitting signals. Electricity,
as another example, can be used for transmitting both energy and information [Ren60].
There are a variety of energies and information. If we disregard the actual form of
energy (mechanical, thermal, chemical, electrical and atomic, etc.) and the real content of
information, what will be left is the pure quantity [Ren60]. The principle of energy conser-
vation was formulated and developed in the middle of the last century, while the essence
of information was studied later in the 1940s. With the quantity of energy, we can come
up to the conclusion that a small amount of U235 contains a large amount of atomic
energy and our world came into the atomic age. With the pure quantity of information, we
can tell that the optical cable can transmit much more information than the ordinary elec-
trical telephone line, and in general, the capacity of a communication channel can be spec-
ified in terms of the rate of information quantity. Although the quantitative measure of
information was originated from the study of communication, it is such a fundamental
concept and method that it has been widely applied to many areas such as statistics, phys-
ics, chemistry, biology, lifescience, psychology, psychobiology, cognitive science, neuro-
science, cybernetics, computer sciences, economics, operation research, linguistics,
philosophy [You87, Kub75, Kap92, Jum86].
The study of quantitative measure of information in communication systems started in
1920s. In 1924 Nyquist showed that the speed of transmission of intelligence over a
telegraph circuit with a fixed line speed is proportional to the logarithm of a number of
W
m
3
Com-
yquist
m:
rtley’s
or uncer-
py in
gest to
rman
hr81].
ntropy
rom a
ever
maxi-
ciple
py of
current values used to encode the message: , where is a constant [Nyq24,
Chr81]. In 1928, Hartley generalized this to all forms of communication, letting repre-
sent the number of symbols available at each selection of a symbol to be transmitted. Hart-
ley explicitly addressed the issue of the quantitative measure for information and pointed
out that it should be independent of psychological factors (or objective) [Har28, Chr81].
Later in 1948, Shannon published his celebrated paper “A Mathematical Theory of
munication,” which explored the statistical structure of a message and extended N
and Hartley’s logarithmic measure for information to a probabilistic logarith
for the probability structure , .
When in the equiprobable case, Shannon’s measure degenerates to Ha
measure [Sha48, Sha62]. Shannon’s measure can also be regarded as a measure f
tainty. It laid the foundation for information theory.
There is a striking formal similarity between Shannon’s measure and the entro
statistical mechanics. This was one of the reasons that led von Neumann to sug
Shannon to call his uncertainty measure the entropy [Tri71]. “Entropie” was a Ge
word coined in 1865 by Clausius to represent the capacity for change of matter [C
The second law of thermodynamics, formulated by Clausius, is also known as the e
law. Its best-known statement has been in the form, “Heat cannot by itself pass f
colder to a hotter system.” Or more formally, the entropy of a closed system will n
decrease, but can only increase until it reaches its maximum [You87]. The entropy
mum principle of a closed system has a corollary that is an energy minimum prin
[Cha87]; i.e., the energy of the closed system will reach its minimum when the entro
the system reaches its maximum.
W k mlog= k
m
I pk pklogk 1=
N
∑–= pk 0 (k≥ 1 … N ), ,= pkk 1=
N
∑ 1=
pk 1 m⁄=
4
ann
nn’s
obable
system
titute a
such
ermo-
te of
n of
ures of
ntropy
a sys-
ars is
s of
wer to
ndi-
, the
maxi-
infor-
rvation
sipa-
el, he
Clausius’ entropy was initially an abstract and macroscopic idea. It was Boltzm
who first gave the entropy a microscopic and probabilistic interpretation. Boltzma
work showed that entropy could be understood as a statistical law measuring the pr
states of the particles in a closed system. In statistical mechanics, each particle in a
occupies a point in a “phase space,” and so the entropy of a system came to cons
measure for the probability of the microscopic state (distribution of particles) of any
system. According to this interpretation, a closed system will approach a state of th
dynamic equilibrium because equilibrium is overwhelmingly the most probable sta
the system. The probabilistic interpretation of entropy resulted in an interpretatio
entropy that is one of the cornerstones of the modern relationship between meas
entropy and the amount of information in a message. That is, both the information e
and the statistical mechanical entropy are the measure of uncertainty or disorder of
tem [You87].
One interesting problem about entropy which puzzled physicists for almost 80 ye
Maxwell’s Demon, a hypothetical identity which could theoretically sort the molecule
a gas into either of two compartments, say, the faster molecules going into A, the slo
B, resulting in the lowering of the temperature in B while raising it in A without expe
ture of work. But according to the second law of thermodynamics, i.e. the entropy law
temperature of a closed system will eventually be even and thus the entropy be
mized. In 1929, Szilard pointed out that the sorting of the molecules depends on the
mation about the speed of molecules which is obtained by the measurement or obse
on molecules, and any such measurement or observation will invariably involve dis
tion of energy and increase entropy. While Szilard did not produce a working mod
5
pecu-
able
erwise
n and
state of
its
as a
ability
theo-
1957
cs and
ses of
ropy
to use
s we
nciple
showed mathematically that entropy and information were fundamentally interconnected,
and his formula was analogous to the measures of information developed by Nyquist and
Hartley and eventually by Shannon [You87].
Contrary to closed systems, the open systems with energy flux in and out tend to self-
organize and develop and maintain a structural identity, resisting the entropy drift of
closed systems and their irreversible thermodynamic fate [You87, Hak88]. In this area,
Prigogine and his colleagues’ work on nonlinear, nonequilibrium processes made a
liar contribution, which provides a powerful explanation of how order in the form of st
structures can be built up and maintained in a universe whose ingredients seem oth
subject to a law of increasing entropy [You87].
Boltzmann and others’ work gave the relationship between entropy maximizatio
state probabilities; that is, the most probable microscopic state of an ensemble is a
uniformity described by maximizing its entropy subject to constraints specifying
observed macroscopic condition [Chr81]. The maximization of Shannon’s entropy,
comparison, can be used as the basis for equiprobability assumptions (an equiprob
should be used upon the total ignorance of the probability distribution). Information-
retic entropy maximization subject to known constraints was explored by Jaynes in
as a basis for statistical mechanics, which in turn makes it a basis for thermostati
thermodynamics [Chr81]. Jaynes also pointed out: “in making inferences on the ba
partial information we must use that probability distribution which has maximum ent
subject to whatever is known. This is the only unbiased assignment we can make;
any other would amount to arbitrary assumption of information which by hypothesi
do not have” [Jay57: I, page 623]. More general than Jaynes’ maximum entropy pri
6
oss-
ility
of the
ion,
joint
ation,
ntropy
rtation,
signals,
in sig-
or sig-
n be
nsid-
als are
sional
,
is Kullback’s minimum cross-entropy principle, which introduces the concept of cr
entropy or “directed divergence” of a probability distribution P from another probab
distribution Q. The maximum entropy principle can be viewed as a special case
minimum cross-entropy principle when Q is a uniform distribution [Kap92]. In addit
Shannon’s mutual information is nothing but the directed divergence between the
probability distribution and the factorized marginal distributions.
1.2 Motivation
The above gives a brief review of various aspects on energy, entropy and inform
from which we can see how fundamental and general the concepts of energy and e
are, and how these two fundamental concepts are related to each other. In this disse
the major interests and the issues addressed are about the energy and entropy of
especially the empirical energy and entropy measures of signals, which are crucial
nal processing practice. First, let’s take a look at the empirical energy measures f
nals.
There are many kinds of signals in the world. No matter what kind, a signal ca
abstracted as , where is the time index (only discrete time signals are co
ered in this dissertation), represents an m-dimensional real space (only real sign
considered in this dissertation, complex signals can be thought of as a two dimen
real signal). The empirical energy and power of a finite signal
is
(1.1)
X n( ) Rm∈ n
Rm
x n( ) R n,∈ 1 … N, ,=
E x( ) x n( )2
n 1=
N
∑= , P x( ) 1N---- x n( )2
n 1=
N
∑=
7
The difference between two signals and , can be measured
by the empirical energy or power of the difference signal:
(1.2)
The difference between and can also be measured by the cross-correlation
(inner-product)
(1.3)
or its normalized version
(1.4)
The geometrical illustration of these quantities is shown in Figure 1-1.
Figure 1-1. Geometrical Illustration of Energy Quantities
Since , cross-correlation can be regarded as an energy related quan-
tity.
We know that for a random signal with the pdf (probability density function)
, the Shannon information entropy is
x1 n( ) x2 n( ) n 1 … N, ,=
d n( ) x1 n( ) x2 n( )–=
Ed x1 x2,( ) d n( )2
n 1=
N
∑= , Pd x1 x2,( ) 1N---- d n( )2
n 1=
N
∑=
x1 x2
C x1 x2,( ) x1 n( )x2 n( )n 1=
N
∑=
C x1 x2,( ) x1 n( )x2 n( )n 1=
N
∑
x1 n( )2
n 1=
N
∑
x2 n( )2
n 1=
N
∑
⁄=
x1
x2
d
θ
θ( )cos C x1 x2,( )=O
E x( ) C x x,( )=
x n( )
fx x( )
8
(1.5)
Based on the information entropy concept, the difference or similarity between two
random signals and with joint pdf and marginal pdfs ,
can be measured by the mutual information between two signals:
(1.6)
Since , mutual information is an entropy type quantity.
Comparatively, energy is a simple, straightforward idea and easy to implement, while
information entropy uses all the statistics of the signal and is much more profound and dif-
ficult to measure or implement. A very fundamental and important question arises natu-
rally: If a discrete data set is given, what is the information
entropy related to this data set, or how can we estimate the entropy for this data set. This
empirical entropy problem was addressed before in the literatures [Chr80, Chr81, Bat94,
Vio95, Fis97], etc. Parametric methods can be used for pdf estimation and then entropy
estimation, which is straightforward but less general. Nonparametric methods for pdf esti-
mation can be used as the basis for the general entropy estimation (no assumption about
data distribution is required). One example is the historgram method [Bat94] which is easy
to implement in one dimensional space but difficult to apply to high dimensional space,
and also difficult to analyze mathematically. Another popular nonparametric pdf estima-
tion method is the Parzen window method, the so-called kernel or potential function
method [Par62, Dud73, Chr81]. Once the Parzen window method is used, the perplexing
problem left is the calculation of the integral in the entropy or mutual information formula.
Numerical methods are extremely complex in this case and thus only suitable for one
H x( ) fx x( ) fx x( )log xd∫–=
x1 x2 fx1x2x1 x2,( ) fx1
x1( ) fx2x2( )
I x1 x2,( ) fx1x2x1 x2,( )
fx1x2x1 x2,( )
fx1x1( )fx2
x2( )--------------------------------log xd 1 x2d∫=
H x( ) I x x,( )=
x n( ) Rm∈ n 1 … N, ,=
9
dimensional variable [Pha96]. Approximation can also be made by using sample mean
[Vio95] which requires a large amount of data and may not be a good approximation for a
small data set. The indirect method of Fisher [Fis97] can not be used for entropy estima-
tion but only for entropy maximization purposes. For the blind source separation (BSS) or
independent component analysis (ICA) problem [Com94, Cao96, Car98b, Bel95, Dec96,
Car97, Yan97], one popular contrast function is the empirical mutual information between
the outputs of a demixing system, which can be implemented by the difference between
the sum of the marginal entropies and the joint entropy, where joint entropy is usually
related to the input entropy and the determinant of the linear demixing matrix, and the
marginal entropies are estimated based on the moment expansions for pdf such as the
Edgeworth expansion and the Gram-Charlier expansion [Yan97, Dec96]. The moment
expansions have to be truncated in practice and are only appropriate for a one-dimension
(1-D) signal because, in multi-dimensional space, the expansions will become extremely
complicated. So, from the above brief review, we can see that there lacks an effective and
general entropy estimation method.
One major point of this dissertation is to formulate and develop such an effective and
general method for the empirical entropy problem and give a unifying point of view about
signal energy and entropy, especially the empirical signal energy and entropy.
Surprisingly, if we regard each data sample mentioned above as a physical particle,
then the whole discrete data set is just like a set of particles in a statistical mechanical sys-
tem. It might be interesting to think what is the information entropy of this data set and
how this can be related to physics.
10
data
mples
rma-
alled
rning
n self-
am-
nergy
idely
pplica-
ficial or
or the
major
mula-
ich it
Lin-
88,
uires
According to the modern science, the universe is a mass-energy system. In such mass-
energy spirit, we would ask whether the information entropy, especially the empirical
information entropy, would somehow have mass-energy properties. In this dissertation,
the empirical information entropy is related to “potential energy” of “data particles” (
samples). Thus, a data sample is called “information particle” (IPT). In fact, data sa
are basic units conveying information; they indeed are “particles” which transmit info
tion. Accordingly, the empirical entropy can be related to the potential energy c
“information potential” (IP) of “information particles” (IPTs).
With the information potential, we can further study how it can be used in a lea
system or an adaptive system of signal processing, and how a learning system ca
organize with the information flux in and out (often in the form of the flux of data s
ples), just like an open physical system which will appear some orders with the e
flux in and out.
The information theory originated from communication study and has been w
used for the design and practice in this area and many other areas. However, its a
tion to learning systems or adaptive systems such as perceptual systems, either arti
natural, is just in its infancy. Some early researchers tried to use information theory f
explanation of a perceptual process, e.g. Attneave who pointed out in 1954 that “a
function of the perceptual machinery is to strip away some of the redundancy of sti
tion, to describe or encode information in a form more economical than that in wh
impinges on the receptors” [Hay94: page 444]. However, only in the late 1980s did
sker propose the principle of maximum information preservation (InfoMax) [Lin
Lin89] as the basic principle for the self-organization of neural networks, which req
11
nt fea-
tc. In
lot of
aus-
forma-
tual
tion
a
the
tches
s the
l entro-
ifying
om the
xam-
the maximization of the mutual information between the output and the input of the net-
work so that the information about the input is best preserved in the output. Linsker further
applied the principle to linear networks with Gaussian assumption on input data distribu-
tion and noise distribution, and derived the way to maximize the mutual information in
this particular case [Lin88, Lin89]. In 1988, Plumbley and Fallside proposed the similar
minimum information loss principle [Plu88]. In the same period, there are other research-
ers who use the information-theoretic principles but still with the limitation of linear
model or Gaussian assumption, for instance, Becker and Hinton’s spatially cohere
tures [Bec89, Bec92], Ukrainec and Haykin’s spatially incoherent features [Ukr92], e
recent years, the information-theoretic approaches for BSS and ICA have drawn a
attention. Although they certainly broke the limitation of the model linearity and the G
sian assumption, the methods are still not general enough. There are two typical in
tion-theoretic methods in this area: maximum entropy (ME) and minimum mu
information (MMI) [Bel95, Yan97, Yan98, Pha96]. Both methods use the entropy rela
of a full rank linear mapping: , where and is
full rank square matrix. Thus the estimation of information quantities is coupled with
network structure. Moreover, ME requires that the nonlinearity in the outputs ma
with the cdf (cumulative density function) of the source signals [Bel95], and MMI use
above mentioned expansion methods or numerical method to estimate the margina
pies [Yan97, Yan98, Pha96]. On the whole, there lacks a general method and a un
point of view about the estimation of information quantities.
Human beings and animals in general are examples of systems that can learn fr
interactions with their environments. Such interactions are usually in the form of “e
H Y( ) H X( ) det W( )log+= Y WX= W
12
arn to
gen-
ignal
terac-
stract
rning
odeled
rning
hat the
rrent,
lobal
have
ples” (or called “data samples”). For instance, children learn to speak by listening, le
recognize objects by being presented with exemplars, learn to walk by trying, etc. In
eral, children learn by the stimulation from their environment. Adaptive systems for s
processing [Wid85, Hay94, Hay96] are also learning systems that evolve with the in
tion with input, output and desired (or teacher) signals.
To study the general principle of a learning system, we first need to set an ab
model for the system and its environment. As illustrated in Figure 1-2, an abstract lea
system is a mapping , where is the input signal,
is the output signal, is a set of parameters of the mapping. The environment is m
by the doublet , where is a desired signal (teacher signal). The lea
mechanism is a set of rules or procedures that will adjust the parameters so t
mapping achieves a desired goal.
Figure 1-2. Illustration of a Learning System
There are a variety of learning systems, linear or nonlinear, feedforward or recu
full rank or dimension reduced, perceptron and multilayer perceptron (MLP) with g
basis or radial-basis function with local basis, etc. Different system structures may
different property and usage [Hay98].
Rm
Rk→ : Y q X W,( )= X R
m∈ Y Rk∈
W
X D,( ) D Rk∈
W
Learning System
Y q X W,( )=Input Signal Output Signal
X Y
Desired SignalD
Learning
Mechanism
13
ning
rested
ta. Of
orate
about
arning
anism
re no
o look
learn-
ystem
ers the
. This
ent
ome-
gard-
ion is
The environment doublet also has a variety of forms. A learning process can
have a desired signal or not (very often the input signal is the implicit desired signal).
Some statistical property of or or can be given or assumed. Most often, only a dis-
crete data set is provided. Such a scheme is called “lear
from examples” and is a general case [Hay94, Hay98]. This dissertation is more inte
in “learning form examples” than any scheme with some assumptions about the da
course, if a priori knowledge about data is known, a learning method should incorp
this knowledge.
There are also a lot of learning mechanisms. Some of them make assumptions
data, and others do not. Some are coupled with the structure and topology of the le
system, while the others are independent of the system. A general learning mech
should not depend on data and should be de-coupled from the learning system.
There is no doubt that the area is rich in diversity but lacks unification. There a
more known abstract and fundamental concepts such as energy and information. T
for the essence of learning, one should start from these two basic ideas. Obviously,
ing is about obtaining knowledge and information. Based on the above learning s
model, we can say that learning is nothing but to transfer onto the machine paramet
information contained in the environment or in a given data set to be more specific
dissertation will try to give a unifying point of view for learning systems and to implem
it by using the proposed information potential.
The basic purpose of learning is to generalize. The ability of animals to learn s
thing general from their past experiences is the key to their survival in the future. Re
ing the generalization ability of a learning machine, one very fundamental quest
X D,( )
X Y D
Ω Xi Di,( ) i 1 … N, ,= =
14
less,”
ation
ing
] men-
time
after a
earning
ich is
lgo-
naptic
orma-
ple-
of a
ebb’s
volve
what is the best we can do to generalize for a given learning system and a given set of
environmental data? One thing is very clear that the information contained in the given
data set is a quantity that can not be changed by any learning method, and no learning
method can go beyond that. Thus, it is the best that one learning system can possibly
obtain. Generalization, from this point of view, is not to create something new but to uti-
lize fully the information contained in the observed data, neither less nor more. By “
we mean that the information obtained by a learning system is less than the inform
contained in the given data. By “more,” we mean that implicitly or explicitly, a learn
method assumes something that is not given. This is also the spirit of Jaynes [Jay57
tioned above and similar point of view can be found in Christensen [Chr80, Chr81].
The environmental data for a learning system are usually not collected all at one
but are accumulated during a learning process. Whenever one datum appears or
small set of data is obtained, learning should take place and the parameters of the l
system should be updated. This is the problem of the on-line learning method, wh
also the issue that this dissertation is going to deal with.
Another problem that this dissertation is interested in is the “local” learning a
rithms. In a biological nervous system, what can be changed is the strength of sy
connections. The change of a synaptic connection can only depend on its local inf
tion, i.e. its input and output. For an engineering system, it will be much easier to im
ment by either hardware or software if the learning rule is “local;” i.e., the update
connection in a learning network system only relies on its input and output. The H
rule is a famous neuropsychological postulation of how a synaptic connection will e
15
with its input and output [Heb49, Hay98]. It will be shown in this dissertation how Heb-
bian type algorithms can be related to the energy and entropy of signals.
1.3 Outline
In Chapter 2, the basic ideas of energy, information entropy and their relationship will
be reviewed. Since the information entropy directly relies on the pdf of the variable, the
Parzen window nonparametric method will be reviewed for the development of the idea of
information potential and cross information potential. Finally, the derivation will be given,
the idea of the information force in a information potential field will be introduced for its
use in learning systems, and the calculation procedure for information potential and cross
information potential and all the forces in corresponding information potential fields will
be described.
In Chapter 3, a variety of learning systems and learning mechanisms will be reviewed.
A unifying point of view about learning by information theory will be given. The informa-
tion potential implementation for the unifying idea will be described. And generalization
of learning will be discussed.
In Chapter 4, the on-line local algorithms for a linear system with energy criteria will
be reviewed. The relationship between Hebbian, anti-Hebbian rules and the energy criteria
will be discussed. An on-line local algorithm for generalized eigen-decomposition will be
proposed, with the discussion of convergence properties such as the convergence speed
and stability.
Chapter 5 will give several application examples. First, the information potential
method will be applied to aspect angle estimation for SAR images. Second, the same
method will be applied to the SAR automatic target recognition. Third, the example of the
16
training of layered neural network by the information potential method will be described.
Fourth, the method will be applied to independent component analysis and blind source
separation.
Chapter 6 will conclude the dissertation and provide a survey on the future work in
this area.
CHAPTER 2
ENERGY, ENTROPY AND INFORMATION POTENTIAL
2.1 Energy, Entropy and Information of Signals
2.1.1 Energy of Signals
From the statistical point of view, the energy of a 1-D stationary signal is related to its
variance. For a 1-D stationary signal with variance and mean , its energy (pre-
cisely short time energy or power) is
(2.1)
where is the expectation operator. If , then the energy is equal to the variance
. So, basically, energy is a quantity related to second order statistics.
For two 1-D signals and with mean and respectively, the co-vari-
ance , and we have the cross-correlation
between two signals:
(2.2)
If at least one signal is zero-mean, .
For a 2-D signal , all the second statistics are given in a covariance
matrix , and we have
(2.3)
x n( ) σ2m
Ex E x2[ ] σ2
m2
+= =
E[ ] m 0=
Ex σ2=
x1 n( ) x2 n( ) m1 m2
r E x1 m1–( ) x2 m2–( )[ ] E x1x2[ ] m1m2–= =
c12 Cx1x2E x1x2[ ] r m1m2+= = =
c12 r=
X x1 x2,( )T=
Σ
E XXT[ ] Σ
m12
m1m2
m1m2 m22
+= Σσ1
2r
r σ22
=
17
18
-defi-
ix (or
lica-
Usually, the first order statistics has nothing to do with the information; we will just
consider zero-mean case; thus we have .
For a 2-D signal, there are three energy quantities in the covariance matrix: ,
and . One may ask what is the overall energy quantity for a 2-D signal. From linear alge-
bra [Nob88], there are 3 choices: the first is the determinant of which is a volume mea-
sure in the 2-D signal space and is equal to the product of all the eigenvalues of ; second
is the trace of which is equal to the sum of all the eigenvalues of ; the third is the
product of all the diagonal elements. Thus, we have
(2.4)
where is the trace operator, the use of log function in and is to reduce the
dynamic range of the original quantities and this is also related to the information of the
signal which will be clear later in this chapter.
The component signals and will be called marginal signals in this dissertation.
If the two marginal signals and are uncorrelated, then . In general, we have
(2.5)
where the equality holds if and only if the two marginal signals are uncorrelated. This is
the so-called Hadamard’s inequality [Nob88, Dec96]. In general, for a positive semi
nite matrix , we have the same inequality where is the determinant of the matr
its logarithm, note that logarithm is a monotonic increasing function); is the multip
tion of the diagonal components (or its logarithm)
E XXT[ ] Σ=
σ12 σ2
2
r
Σ
Σ
Σ Σ
J1 Σlog=
J2 tr Σ( ) σ12 σ2
2+= =
J3 σ12σ2
2( )log=
tr( ) J1 J3
x1 x2
x1 x2 J1 J3=
J3 J1≥
Σ J1
J3
19
i.e.,
onor-
with
at they
When the two marginal signals are uncorrelated and their variances are equal, then
and are equivalent in the sense that
(2.6)
For a n-D signal with zero-mean, we have covariance matrix
(2.7)
where are the variance of the marginal signals ,
are the cross-correlations between the marginal sig-
nals and . The three possible overall energy measure are
(2.8)
Hadamard’s inequality is , the equality holds if and only if is diagonal;
the marginal signals are uncorrelated with each other.
is equal to the sum of all the eigenvalues of and is invariant under any orth
mal transformation (rotation transform). When the marginal signals are uncorrelated
each other and their variances are equal, and are equivalent in the sense th
are related by a monotonic increasing function:
(2.9)
JI
J2
JI 2 J2log 2 2log– 2 σ2log= =
X x1 … xn, ,( )T=
Σ E XXT[ ]
σ12 … r1n
… … …
rn1 … σn2
= =
σi2 i 1 … n, ,=( ) xi
rij i 1 … n, ,= j, 1 … n, ,= i j≠,( )
xi xj
J1 Σlog=
J2 tr Σ( ) σi2
i 1=
n
∑= =
J3 σi2
i 1=
n
∏
log=
J3 J1≥ Σ
J2 Σ
J2 J1
JI n J2log n nlog– n σ2log= =
20
the
alized
culate
(e.g.
tropy
er) for
equiv-
use
ven
ped
oba-
ights
2.1.2 Information Entropy
Compared with energy, the information entropy of a signal involves all the statistics of
a signal, and thus is more profound and difficult to implement.
As mentioned in Chapter 1, the study of abstract quantitative measures for information
started in 1920s when Nyquist and Hartley proposed a logarithmic measure [Nyq24,
Har28]. Later in 1948, Shannon pointed out that the measure is valid only if all events are
equiprobable [Sha48]. Further he coined the term “information entropy” which is
mathematical expectation of Nyquist and Hartley’s measures. In 1960, Renyi gener
Shannon’s idea by using an exponential function rather than a linear function to cal
the mean [Ren60, Ren61]. Later on, other forms of information entropy appeared
Havrda and Charvat’s measure, Kapur’s measure) [Kap94]. Although Shannon’s en
is the only one which possesses all the postulated properties (which will be given lat
an information measure, the other forms such as Renyi’s and Havrda-Charvat’s are
alent with regards to entropy maximization [Kap94]. In a real problem, which form to
depends upon other requirements such as ease of implementation.
For an event with probability , according to Hartley’s idea, the information gi
when this event happens is [Har28]. Shannon further develo
Hartley’s idea, resulting in Shannon’s information entropy for a variable with the pr
bility distribution :
(2.10)
In the general theory of means, a mean of the real numbers with we
has the form
p
I p( ) 1p---log plog–= =
pk k 1 … n, ,=
Hs pkI pk( )k 1=
n
∑= pkk 1=
n
∑ 1= pk 0≥
x1 … xn, ,
p1 … pn, ,
21
ive.”
he
enyi’s
ilar to
har-
than a
all the
(2.11)
where is Kolmogorov-Nagumo function, which is an arbitrary continuous and
strictly monotonic function defined on the real numbers. So, in general, the entropy mea-
sure should be [Ren60, Ren61]
(2.12)
As an information measure, can not be arbitrary since information is “addit
To meet the additivity condition, can be either or . If t
former is used, (2.12) will become Shannon’s entropy (2.10). If the latter is used, R
entropy with order is obtained [Ren60, Ren61]:
(2.13)
In 1967, Havrda and Charvat proposed another entropy measure which is sim
Renyi’s measure but has different scaling [Hav67, Kap94] (it will be called Havrda-C
vat’s entropy or H-C entropy for short):
(2.14)
There are also some other entropy measures, for instance,
[Kap94]. Different entropy measures may have different properties. There are more
dozen properties for Shannon’s entropy. We will discuss five basic properties since
ϕ 1–pkϕ xk( )
k 1=
n
∑
ϕ x( )
ϕ 1–pkϕ I pk( )( )
k 1=
n
∑
ϕ( )
ϕ( ) ϕ x( ) x= ϕ x( ) 21 α–( )x
=
α
HRα1
1 α–------------ pk
α
k 1=
n
∑
log= α 0 α 1≠,>
Hhα1
1 α–------------ pk
α
k 1=
n
∑ 1–
= α 0 α 1≠,>
H∞ maxk
pk( )( )log–=
22
-
op-
other properties can be derived from these properties [Sha48, Sha62, Kap92, Kap94,
Acz75].
(1) The entropy measure is a continuous function of all the probabilities
, which means that a small change in probability distribution will only result in a small
change in the entropy.
(2) is permutationally symmetric; i.e., the position change of any two or
more in will not change the entropy value. Actually, the permutation of
any in the distribution will not change the uncertainty or disorder of the distribution
and thus should not affect the entropy.
(3) is a monotonic increasing function of . For an equiprobable
distribution, when the number of choices increases, the uncertainty or disorder
increases, and so does the entropy measure.
(4) Recursivity: If an entropy measure satisfies (2.15) or (2.16), then it has the recur-
sivity property. It means that the entropy of outcomes can be expressed in terms of the
entropy of outcomes plus the weighted entropy of the combined 2 outcomes.
(2.15)
(2.16)
where is the parameter in Renyi’s entropy or H-C entropy.
(5) Additivity: If and are two independent proba
bility distribution, and the joint probability distribution is denoted by , then the pr
erty is called additivity.
H p1 … pn, ,( )
pk
H p1 … pn, ,( )
pk H p1 … pn, ,( )
pk
H 1 n⁄ … 1 n⁄, ,( ) n
n
n
n 1–
Hn p1 p2 … pn, , ,( ) Hn 1– p1 p2+ p3 … pn, , ,( ) p1 p2+( )H2
p1
p1 p2+-----------------
p2
p1 p2+-----------------,
+=
Hn p1 p2 … pn, , ,( ) Hn 1– p1 p2+ p3 … pn, , ,( ) p1 p2+( )αH2
p1
p1 p2+-----------------
p2
p1 p2+-----------------,
+=
α
p p1 …pn,( )= q q1 … qm, ,( )=
p q•
H p q•( ) H p( ) H q( )+=
23
ns
either
rinci-
ation
enyi’s
y and
94]:
and it
can be
The following table gives the comparison of the three types of entropy about the above
five properties:
From the table, we can see that the three types of entropy differ in recursivity and addi-
tivity. However, Kapur pointed out: “The maximum entropy probability distributio
given by Havrda-Charvat and Renyi’s measures are identical. This shows that n
additivity nor recursivity is essential for a measure to be used in maximum entropy p
ple” [Kap94: page 42]. So, the three entropies are equivalent for entropy maximiz
and any of them can be used.
As we can see from the above, Shannon’s entropy has no parameter, but both R
entropy and Havrda-Charvat’s entropy have a parameter . So, both Renyi’s entrop
Havrda-Charvat’s measures constitute a family of entropy measures.
There is a relation between Shannon’s entropy and Renyi’s entropy [Ren60, Kap
(2.17)
i.e., the Renyi’s entropy is a monotonic decreasing function of the parameter
approaches Shannon’s entropy when approaches 1. Thus, Shannon’s entropy
regarded as one member of the Renyi’s entropy family.
Similar results hold for Havrda-Charvat’s entropy measure [Kap94]:
Table 2-1. The Comparison of Properties of Three Entropies
Properties (1) (2) (3) (4) (5)
Shannon’s yes yes yes yes yes
Renyi’s yes yes yes no yes
H-C’s yes yes yes yes no
α
HRα Hs HRβ≥ ≥ if 1 α 0 and β 1>> >,
limα 1→
HRα Hs=
α
α
24
arvat’s
orma-
rta-
se of
tion.
nn-
n-
2.17)
y con-
ntropy
ind dis-
(2.18)
Thus, Shannon’s entropy can also be regarded as one member of Havrda-Ch
entropy family. So, both Renyi and Havrda-Charvat generalize Shannon’s idea of inf
tion entropy.
When , is called quadratic entropy [Jum90]. In this disse
tion, is also called quadratic entropy for convenience and becau
the dependence of the entropy quantity on the quadratic form of probability distribu
The quadratic form will give us more convenience as we will see later.
For the continuous random variable with pdf , similarly to the Boltzma
Shannon differential entropy , we can obtain the differe
tial version for these two types of entropy:
(2.19)
The relationship among Shannon’s, Renyi’s and Havrda-Charvat’s entropies in (
and (2.18) will hold for their corresponding differential entropies.
2.1.3 Geometrical Interpretation of Entropy
From the above, we see that both Renyi’s entropy and Havrda-Charvat’s entrop
tain the term for a discrete variable, and both of them approach Shannon’s e
when approaches . This suggests that all these entropies are related to some k
Hhα Hs Hhβ≥ ≥ if 1 α 0 and β 1>> >,
limα 1→
Hhα Hs=
α 2= Hh2 1 pk2
k 1=
n
∑–=
HR2 pk2
k 1=
n
∑log–=
Y fY y( )
Hs Y( ) fY y( ) fY y( )log yd∞–
+∞
∫–=
HRα Y( ) 11 α–------------ fY y( )α
yd∞–
+∞
∫
HR2 Y( ) fY y( )2yd
∞–
+∞
∫
log–=log=
Hhα Y( ) 11 α–------------ fY y( )α
yd∞–
+∞
∫ 1–
Hh2 Y( ) 1 fY y( )2yd
∞–
+∞
∫–==
pkα
k 1=
n
∑α 1
25
-
m:
-
ith
ove-
d
e
tance between the point of the probability distribution and the origin in
the space of . As illustrated in Figure 2-1, the probability distribution point
is restricted to a segment of the hyperplane defined by and
(in the left graph below, the region is the line connecting two points (1,0) and (0,1);
in the right graph below, the region is the triangular area confined by the three connecting
lines between each pair of three points (1,0,0), (0,1,0) and (0,0,1)). The entropy of the
probability distribution is a function of , which is the -
norm of the point raised power to [Nov88, Gol93] and will be called “entropy
norm.” Renyi’s entropy rescale the “entropy -norm” by a logarith
; while Havrda-Charvat’s entropy linearly rescales the “entropy
norm” : .
Figure 2-1. Geometrical Interpretation of Entropy
So, both Renyi’s entropy with order ( ) and Havrda-Charvat’s entropy w
order ( ) are related to the -norm of the probability distribution . For the ab
mentioned infinity entropy , there is a relation an
[Kap94]. Therefore, is related to the infinity-norm of th
p p1 … pn, ,( )=
Rn
p p1 … pn, ,( )= pkk 1=
n
∑ 1=
pk 0≥
p p1 … pn, ,( )= Vα pkα
k 1=
n
∑= α
p α α
α Vα
HRα1
1 α–------------ Vαlog= α
Vα Hhα1
1 α–------------ Vα 1–( )=
0
1
1
p p1 p2,( )=
p1
p2
11
1
p2
p1
p30
p p1 p2 p3, ,( )=
pkα
k 1=
n
∑ p αα
(entropy α-norm)=
(α norm of – p raised power to α)
α HRα
α Hhα α p
H∞ limα ∞→
HRα H∞=
H∞ maxk
pk( )( )log–= H∞
26
nd
of 1-
ion
d
in
orm of
er
tri-
ies) is
nt
are
tropy
the
are
tropy
the
,
oba-
ori-
ropy
probability distribution . For Shannon’s entropy, we have a
. It might be interesting to consider Shannon’s entropy as the result
norm of the probability distribution . Actually, the 1-norm of any probability distribut
is always 1 ( ). If we plug and in an
, we will get . Its limit, however, is Shannon’s entropy. So,
the limit sense, Shannon’s entropy can be regarded as the function value of the 1-n
the probability distribution. Thus, we can generally say that the entropy with ord
(either Renyi’s or H-C’s) is a monotonic function of the -norm of the probability dis
bution , and the entropy (all entropies, at least all the above-mentioned entrop
essentially a monotonic function of the distance from the probability distribution poi
to the origin.
When , both Renyi’s entropy and Havrda-Charvat’s entropy
monotonic decreasing functions of the “entropy -norm” . So, in this case, the en
maximization is equivalent to the minimization of the “entropy -norm” , and
entropy minimization is equivalent to the maximization of the “entropy -norm” .
When , both Renyi’s entropy and Havrda-Charvat’s entropy
monotonic increasing functions of the “entropy -norm” . So, in this case, the en
maximization is equivalent to the maximization of the “entropy -norm” , and
entropy minimization is equivalent to the minimization of the “entropy -norm” .
Of particular interest in this dissertation are the quadratic entropies and
which are both monotonic decreasing functions of the “entropy 2-norm” of the pr
bility distribution and are related to the Euclidean distance from the point to the
gin. The entropy maximization is equivalent to the minimization of ; and the ent
p limα 1→
HRα Hs=
limα 1→
Hhα Hs=
p
pkk 1=
n
∑ 1= V1 1= α 1= HRα1
1 α–------------ Vαlog=
Hhα1
1 α–------------ Vα 1–( )= 0 0⁄
α
α
p
p
α 1> HRα Hhα
α Vα
α Vα
α Vα
α 1< HRα Hhα
α Vα
α Vα
α Vα
HR2 Hh2
V2
p p
V2
27
non’s
nc-
ned
be
us is
ation
he
tric;
r-
minimization is equivalent to the maximization of . Moreover, since both and
are lower bounds of Shannon’s entropy, they might be more efficient than Shan
entropy for entropy maximization.
For a continuous variable , the probability density function is a point in a fu
tional space. All the pdf will constitutes a similar region in a “hyperplane” defi
by and . The similar geometrical interpretation can also
given to the differential entropies. In particular, we have the “entropy -norm” as
(2.20)
2.1.4 Mutual Information
Mutual information (MI) measures the relationship between two variables and th
more desirable in many cases. Following Shannon [Sha48, Sha62], the mutual inform
between two random variables and is defined as
(2.21)
where is the joint pdf of joint variable , and are t
marginal pdf for and respectively. Obviously, mutual information is symme
i.e., . It is not difficult to show the relation between mutual info
mation and Shannon’s entropy in (2.22) [Dec96, Hay98]:
(2.22)
V2 HR2
Hh2
Y fY y( )
fY y( )
fY y( ) yd∞–
+∞
∫ 1= fY y( ) 0≥
α
Vα fY y( )αyd
∞–
+∞
∫= V2 fY y( )2yd
∞–
+∞
∫=
X1 X2
Is X1 X2,( ) fX1X2x1 x2,( )
fX1X2x1 x2,( )
fX1x1( )fX2
x2( )---------------------------------log x1d x2d∫∫=
fX1X2x1 x2,( ) x1 x2,( )T
fX1x1( ) fX2
x2( )
X1 X2
Is X1 X2,( ) Is X2 X1,( )=
Is X1 X2,( ) Hs X1( ) Hs X1 X2( )–=
Hs X2( ) Hs X2 X1( )–=
Hs X1( ) Hs X2( ) Hs X1 X2,( )–+=
28
seen
n
that
pro-
all the
simple
of the
han-
bound
(K-L
r-
where and are the marginal entropies; is the joint entropy;
is the conditional entropy of given which is
the measure of uncertainty of when is given, or the uncertainty left in
when the uncertain of is removed; similarly, is the conditional entropy of
given (all entropies involved are Shannon’s entropy). From (2.22), it can be
that the mutual information is the measure of the uncertainty removed from whe
is given, or in another word, the mutual information is the measure of the information
convey about (or vice versa since the mutual information is symmetric). It
vides a measure of the statistical relationship between and , which contains
statistics of the related distributions and thus is a more general measure than a
cross-correlation between and which only involve the second order statistics
variables.
It can be shown that the mutual information is non-negative, or equivalently the S
non’s entropy reduces on conditioning, or the total marginal entropies is the upper
of the joint entropy; i.e.,
(2.23)
The mutual information can also be regarded as the Kullback-Leibler divergence
divergence or called cross-entropy) [Kul68, Dec96, Hay98] between the joint
and the factorized marginal pdf . The Kullback-Leibler dive
gence between two pdfs and is defined as
(2.24)
Hs X1( ) Hs X2( ) Hs X1 X2,( )
Hs X1 X2( ) Hs X1 X2,( ) Hs X2( )–= X1 X2
X1 X2 X1 X2,( )
X2 Hs X2 X1( )
X2 X1
X1 X2
X2 X1
X1 X2
X1 X2
Is X1 X2,( ) 0≥
Hs X1( ) H≥ s X1 X2( ) Hs X2( ) H≥ s X2 X1( ),
Hs X1 X2,( ) Hs X1( ) Hs X2( )+≤
fX1X2x1 x2,( ) fX1
x1( )fX2x2( )
f x( ) g x( )
Dk f g,( ) f x( ) f x( )g x( )----------log xd∫=
29
func-
ction
rom
n be
f two
nce
is not
ce.”
ce”
er
p92,
Jensen’s inequality [Dec96, Ace92] says for a random variable and a convex
tion , the expectation of this convex function of is no less than the convex fun
of the expectation of ; i.e.,
(2.25)
where is the operator of mathematical expectation, is the pdf of . F
Jensen’s inequality [Dec96, Kul68], or by using the derivation in Acero [Ace92], it ca
shown that the Kullback-Leibler divergence is non-negative and is zero if and only i
distributions are the same; i.e.,
(2.26)
where the equality holds if and only if . So, the Kullback-Leibler diverge
can be regarded as a “distance” measure between pdfs and . However, it
symmetric; i.e., in general, and thus is called “directed divergen
Obviously, the mutual information mentioned above is the Kullback-Leibler “distan
from the joint pdf to the factorized marginal pdf
.
Based on Renyi’s entropy, we can define Renyi’s divergence measure with ord
for two pdf and [Ren60, Ren6, Kap94]:
(2.27)
The relation between Renyi’s divergence and Kullback-Leibler divergence is [Ka
Kap94]
X
h x( ) X
X
E h X( )[ ] h E X[ ]( ) or≥
h x( )fX x( ) x h xfX x( ) xd∫( )≥d∫
E[ ] fX x( ) X
Dk f g,( ) f x( ) f x( )g x( )----------log xd∫ 0≥=
f x( ) g x( )=
f x( ) g x( )
Dk f g,( ) Dk g f,( )≠
fX1X2x1 x2,( ) fX1
x1( )fX2x2( )
Dk fX1X2x1 x2,( ) fX1
x1( )fX2x2( ),( )
α
f x( ) g x( )
DRα f g,( ) 1α 1–( )
----------------- f x( )α
g x( )α 1–-------------------- xd∫log=
30
mea-
eibler
o-
rther-
t) are
fore,
sig-
res is
re is a
f solu-
t like
nd
n
(2.28)
Based on Havrda-Charvat’s entropy, there is also Havrda-Charvat’s divergence
sure with order for two pdfs and [Hav67, Kap92, Kap94]:
(2.29)
There is also a similar relation between this divergence measure and Kullback-L
divergence [Kap92, Kap94]:
(2.30)
Unfortunately, as Renyi pointed out is not appr
priate as a measure of mutual information of the variables and [Ren60]. Fu
more, all these divergence measures (Kullback-Leibler, Renyi and Havrda-Charva
complicated due to the calculation of the integrals involved in their formula. There
they are difficult to implement in the “learning from examples” and general adaptive
nal processing applications where the maximization or minimization of the measu
desired. In practice, simplicity becomes a paramount consideration. Therefore, the
need for alternative measures which may have the same maximum or minimum pd
tions as Kullback-Leibler divergence but at the same time is easy to implement, jus
the case of the quadratic entropy which meet these two requirements.
For discrete variables and with probability distribution a
respectively, and the joint probability distributio
, the Shannon’s mutual information is defined as
limα 1→
DRα f g,( ) Dk f g,( )=
α f x( ) g x( )
Dhα f g,( ) 1α 1–( )
----------------- f x( )α
g x( )α 1–-------------------- xd∫ 1–=
limα 1→
Dhα f g,( ) Dk f g,( )=
DRα fX1X2x1 x2,( ) fX1
x1( )fX2x2( ),( )
X1 X2
X1 X2 PX1
ii = 1, ..., n
PX2
jj = 1, ..., m
PXij
i = 1, ..., n ; j = 1, ..., m
31
non’s
ross-
deep
ns. The
, espe-
sibil-
be
pter).
some
iables
r the
). It is
vari-
en-
stance
hen
(2.31)
2.1.5 Quadratic Mutual Information
As pointed out by Kapur [Kap92], there is no reason to restrict ourselves to Shan
measure for entropy and to confine ourselves to Kullback-Leibler’s measure for c
entropy (density discrepancy or density distance). Entropy or cross-entropy is too
and too complex a concept to be measured by a single measure under all conditio
alternative measures for entropy discussed in 2.1.2 break such restriction on entropy
cially, there are entropies with simple quadratic form of pdfs. In this section, the pos
ity of “mutual information” measures with only simple quadratic form of pdfs will
discussed (the reason to use quadratic form of pdfs will be clear later in this cha
These measures will be called quadratic mutual information although they may lack
properties of Shannon’s mutual information.
Independence is a fundamental statistical relationship between two random var
(the extension of the idea of independence to multiple variables is not difficult, fo
simplicity of exposition, only the case of two variables will be discussed at this stage
defined when the joint pdf is equal to the factorized marginal pdfs. For instance, two
ables and are independent with each other when
(2.32)
where is the joint pdf and and are marginal pdfs. As m
tioned in the previous section, the mutual information can be regarded as a di
between the joint pdf and the factorized marginal pdf in the pdf functional space. W
Is X1 X2,( ) PXij PX
ij
PX1
iPX2
j----------------log
j 1=
m
∑i 1=
n
∑=
X1 X2
fX1X2x1 x2,( ) fX1
x1( )fX2x2( )=
fX1X2x1 x2,( ) fX1
x1( ) fX2x2( )
32
ion in
ropri-
ximiza-
n of
the distance is zero, the two variables are independent. When the distance is maximized,
two variables will be far away from the independent state and roughly speaking the depen-
dence between them will be maximized.
The Euclidean distance is a simple and straightforward distance measure for two pdfs.
The squared distance between the joint pdf and the factorized marginal pdf will be called
Euclidean distance quadratic mutual information (ED-QMI). It is defined as
(2.33)
Obviously, the ED-QMI between and : is non-negative and is zero
if and only if = ; i.e., and are independent with each
other. So, it is appropriate to measure the independence between and . Although
there is no strict theoretical justification yet that the ED-QMI is an appropriate measure
for the dependence between two variables, the experimental results described later in this
dissertation and the comparison between ED-QMI and Shannon’s Mutual Informat
some special cases described later in this chapter will all support that ED-QMI is app
ate to measure the degree of dependence between two variables, especially the ma
tion of this quantity will give reasonable results. For multiple variables, the extensio
ED-QMI is straightforward:
(2.34)
where is the joint pdf, are marginal pdfs.
DED f g,( ) f x( ) g x( )–( )2xd∫=
IED X1 X2,( ) DED fX1X2x1 x2,( ) fX1
x1( )fX2x2( ) ,( )=
X1 X2 IED X1 X2,( )
fX1X2x1 x2,( ) fX1
x1( )fX2x2( ) X1 X2
X1 X2
IED X1 … Xk, ,( ) DED fX x1 … xk, ,( ) , fXixi( )
i 1=
k
∏
=
fX x1 … xk, ,( ) fXixi( ) (i=1, ... , k)
33
Another possible pdf distance measure is based on Cauchy-Schwartz inequality
[Har34]: where equality holds if and only if
for a constant scalar . If and are pdfs; i.e., and
, then implies . So, for two pdfs and , we
have equality holding if and only if . Thus, we may define Cauchy-
Schwartz distance for two pdfs as
(2.35)
Obviously, , with equality if and only if almost everywhere
and the integrals involved are all quadratic form of pdfs. Based on , we have
Cauchy-Schwartz quadratic mutual information (CS-QMI) between two variables and
as
(2.36)
where the notations are the same as above. Directly from the above, we have
with the equality if and only if and are independent with each
other. So, is an appropriate measure for independence. However, the experimental
results shows that it might be not appropriate as a dependence measure. For multiple vari-
ables, the extension is also straightforward:
(2.37)
f x( )2xd∫( ) g x( )2
xd∫( ) f x( )g x( ) xd∫( )2
≥
f x( ) ζ g x( )= ζ f x( ) g x( ) f x( ) xd∫ 1=
g x( ) xd∫ 1= f x( ) ζ g x( )= ζ 1= f x( ) g x( )
f x( ) g x( )=
DCS f g,( )f x( )2
xd∫( ) g x( )2xd∫( )
f x( )g x( ) xd∫( )2
-----------------------------------------------------log=
DCS f g,( ) 0≥ f x( ) g x( )=
DCS f g,( )
X1
X2
ICS X1 X2,( ) DCS fX1X2x1 x2,( ) fX1
x1( )fX2x2( ) ,( )=
ICS X1 X2,( ) 0≥ X1 X2
ICS
ICS X1 … Xk, ,( ) DCS fX x1 … xk, ,( ) , fXixi( )
i 1=
k
∏
=
34
For the discrete variables and with probability distribution
and respectively, and the joint probability distribution
, the ED-QMI and CS-QMI are
(2.38)
Figure 2-2. A Simple Example
X1 X2 PX1
ii = 1, ..., n
PX2
jj = 1, ..., m
PXij
i = 1, ..., n ; j = 1, ..., m
IED X1 X2,( ) PXij
PX1
iPX2
j–( )
2
j 1=
m
∑i 1=
n
∑=
ICS X1 X2,( )
PXij( )
2
j 1=
m
∑i 1=
n
∑
PX1
iPX2
j( )2
j 1=
m
∑i 1=
n
∑
PXij
PX1
iPX2
j
j 1=
m
∑i 1=
n
∑ 2
----------------------------------------------------------------------------------------log=
X1
X2
2
21
1PX2
1
PX2
2
PX1
1PX1
2
PX11
PX12
PX22
PX21
35
wn in
;
ith
Figure 2-3. The Surfaces and Contours of , and vs. and
To get an idea about how similar and how different the measures , and will
be, let’s look at a simple case with two discrete random variables and . As sho
Figure 2-2, will be either 1 or 2 and its probability distribution is
i.e., and . Similarly can also be either 1 or 2 w
PX11PX
21
PX11
PX21
Is
Is
IED
PX21
ICS
PX11
PX11
PX21
IED
PX21
PX11 PX
11
PX21
ICS
Is IED ICS PX11
PX21
Is IED ICS
X1 X2
X1 PX1PX1
1PX1
2,( )=
P X1 1=( ) PX1
1= P X1 2=( ) PX1
2= X2
36
en
shows
,
g left
ow that
e mini-
orre-
rent,
proba-
can
ee
d
the
the probability distribution ( and ).
The joint probability distribution is ; i.e.,
, ,
and . Obviously, , ,
and .
First, let’s look at the case with the distribution of fixed . Th
the free parameters left are from 0 to 0.6 and from 0 to 0.4. When and
change in the ranges, the values of , and can be calculated. Figure 2-3
how these values change with and , where the left graphs are surfaces for
and versus and ; the right graphs are the contours of the correspondin
surfaces, (contour means that each line has the same value). These graphs sh
although the surfaces or contours of the three measures are different, they reach th
mum value 0 in the same line where the joint probabilities equal the c
sponding factorized marginal probabilities. And the maximum values, although diffe
are also reached at the same points = (0.6 0) and (0 0.4) where the joint
bilities are
and
respectively. These are just cases where and have a 1-to-1 relation; i.e.,
determine without any uncertainty, and vice versa.
If the marginal probability of is further fixed, e.g. , then the fr
parameter can be from 0 to 0.3. In this case, both marginal probabilities of an
are fixed and the factorized marginal probability distribution is thus fixed and only
PX2PX2
1PX2
2,( )= P X2 1=( ) PX2
1= P X2 2=( ) PX2
2=
PX PX11
PX12
PX21
PX22, , ,( )=
P X1 X2,( ) 1 1,( )=( ) PX11
= P X1 X2,( ) 1 2,( )=( ) PX12
= P X1 X2,( ) 2 1,( )=( ) PX21
=
P X1 X2,( ) 2 2,( )=( ) PX22
= PX1
1PX
11PX
12+= PX1
2PX
21PX
22+=
PX2
1PX
11PX
21+= PX2
2PX
12PX
22+=
X1 PX10.6 0.4,( )=
PX11
PX21
PX11
PX21
Is IED ICS
PX11
PX21
Is IED
ICS PX11
PX21
PX11
1.5PX21
=
PX11
PX21,( )
PX12
PX22
PX11
PX21
0 0.4
0.6 0=
PX12
PX22
PX11
PX21
0.6 0
0 0.4=
X1 X2 X1
X2
X2 PX20.3 0.7,( )=
PX11
X1 X2
37
are
t and
n the
joint probability distribution will change. This case can also be regarded as the previous
case with a further constraint specified by . Figure 2-4 shows how the
three measures change with in this case, from which we can see that the minima are
reached at the same point , and the maxima are also reached at the same point
; i.e.,
Figure 2-4. , and vs.
From this simple example, we can see that although the three measures are different,
they have the same minimum points and also have the same maximum points in this par-
ticular case. It is known that both Shannon’s mutual information and ED-QMI
convex functions of pdfs [Kap92]. From the above graphs, we can confirm this fac
also come up to the conclusion that CS-QMI is not a convex function of pdfs. O
PX11
PX21
+ 0.3=
PX11
PX11
0.18=
PX11
0=
PX12
PX22
PX11
PX21
0.6 0.1
0 0.3=
Is
ICSIED
Is IED ICS PX11
Is IED
ICS
38
ED-
e fol-
is
n or
three
whole, we can say that the similarity between Shannon’s mutual information and
QMI is confirmed by their convexity with the guaranteed same minimum points.
Figure 2-5. Illustration of Geometrical Interpretation to Mutual Information
2.1.6 Geometrical Interpretation of Mutual Information
From the previous section, we can see that both ED-QMI and CS-QMI have th
lowing three terms in their formulas:
(2.39)
where is obviously the “entropy 2-norm” (the squared 2-norm) of the joint pdf,
the “entropy 2-norm” of the factorized marginal pdf and is the cross-correlatio
inner product between the joint pdf and the factorized marginal pdf. With these
terms, QMI can be expressed as
Is
IED
0
IED Euclidean Distance( )fX1X2
x1 x2,( )
fX1x1( )fX2
x2( )
ICS θcos( )2( )log–=
VJ
VM
Is K-L Divergence( )
Vc θcos VJVM=
θ
VJ fX1X2x1 x2,( )2
x1d x2d∫∫=
VM fX1x1( )fX2
x2( )( )2x1d x2d∫∫=
Vc fX1X2x1 x2,( )fX1
x1( )fX2x2( ) x1d x2d∫∫=
VJ VM
Vc
39
(2.40)
Figure 2-5 shows the illustration of the geometrical interpretation to all these quanti-
ties. , as previously mentioned, is the K-L divergence between the joint pdf and the fac-
torized marginal pdf, is the squared Euclidean distance between these two pdfs and
is related to the angle between these two pdfs.
Note that can be factorized as two marginal information potentials and :
(2.41)
2.1.7 Energy and Entropy for Gaussian Signal
It is well known that for a Gaussian random variable with pdf
function , where is the mean and
is covariance matrix, the Shannon’s information entropy is
(2.42)
(see Appendix B for the derivation)
Similarly, we can get the Renyi’s information entropy for :
(2.43)
(The derivation is given in Appendix C)
For Havrda-Charvat’s entropy, we have
IED VJ 2Vc VM+–=
ICS VJ 2 Vclog–log VMlog+=
Is
IED
ICS
VM V1 V2
VM fX1x1( )fX2
x2( )( )2x1d x2d∫∫ V1V2= =
V1 fX1x1( )2
x1d∫=
V2 fX2x2( )2
x2d∫=
X x1 … xk, ,( )T= R
k∈
fX x( ) 1
2π( )k 2⁄ Σ 1 2⁄--------------------------------- 1
2--- x µ–( )TΣ 1–
x µ–( )– exp= µ Σ
Hs X( ) 12--- Σlog
k2--- 2π k
2---+log+=
X
HRα X( ) 12--- Σlog
k2--- 2π k
2--- αlog
α 1–------------
+log+=
40
are
or
ization
(2.44)
(The derivation is given in Appendix D).
Obviously, and in this case
which are consistent with (2.17) and (2.18) respectively.
Since and in (2.42), (2.43) and (2.44) have nothing to do with the data, the data
dependent quantity is or . From the information-theoretic point of view, a mea-
sure of information using energy quantities (the elements in covariance matrix ) is
in (2.4) and (2.8), or just .
If the diagonal elements of are ( ); i.e., the variance of the marginal
signal is , then the Shannon’s and Renyi’s marginal entropies
, , thus we
have
(2.45)
So, in (2.8) is related to the sum of the marginal Shannon’s
Renyi’s entropies. For Shannon’s entropy, we generally have (2.23) and its general
(2.46) [Dec96, Hay98].
(2.46)
Hhα X( ) 11 α–------------ 2π( )
k2--- 1 α–( )
α k2---–
Σ12--- 1 α–( )
1–
=
limα 1→
Hα X( ) Hs X( )= limα 1→
Hhα X( ) Hs X( )=
k α
Σlog Σ
Σ
JI Σlog= Σ
Σ σi2
i 1 … k, ,=
xi σi2
Hs xi( ) 12--- σi
2log
12--- 2πlog
12---+ += HRα xi( ) 1
2--- σi
2log
12--- 2πlog
12--- αlog
α 1–------------
+ +=
Hs xi( )i 1=
k
∑ 12--- σi
2
i 1=
k
∏
logk2--- 2πlog
k2---+ +=
HRα xi( )i 1=
k
∑ 12--- σi
2
i 1=
k
∏
logk2--- 2πlog
k2--- αlog
α 1–------------
+ +=
J3 σi2
i 1=
k
∏
log=
Hs xi( )i 1=
k
∑ Hs X( )≥
41
ad-
e is
rical
to the
error)
not
white
m the
mes
of all
values,
ded as
e zero
tropy
f the
ero.
Applying (2.42) and (2.45) to (2.46), we get Hadamard’s inequality (2.5). So, H
amard’s inequality can be regarded as a special case of (2.46) when the variabl
Gaussian distributed.
The most popular energy quantity used in practice is in (2.8):
(2.47)
where and is the mean of the marginal signal . The geomet
meaning of is the average of the squared Euclidean distance from the data points
“mean point.” If the signal is an error signal, this is so called MSE (mean squared
criterion, and it is wildly applied in learning or adaptive system, etc.. This criterion is
directly related to the information measure of the signal. Only when the signal is
Gaussian with zero-mean, and becomes equivalent as (2.9) shows. So, fro
information-theoretic point of view, when a MSE criterion is used, it implicitly assu
that the error signal is white Gaussian with zero-mean.
As mentioned in 2.1.1, is basically the determinant of , which is the product
the eigenvalues of and can be regarded as a geometrical average of all the eigen
while is the trace of , which is the sum of all the eigenvalues and can be regar
an arithmetic average of all the eigenvalues. Note that can not guarantee th
energy of all the marginal signals but the maximization of can make the joint en
of maximum; while the maximization of can not guarantee the maximum o
joint entropy of but the minimization of can make all the marginal signals z
This is possibility the reason why the minimization of MSE is so popular in practice.
X
J2
J2 tr Σ( ) 1N---- xi n( ) µi–( )2
i 1=
k
∑n 1=
N
∑= =
µ µ1 … µk, ,( )T= µi xi
J2
J2 JI
J1 Σ
Σ
J2 Σ
Σ 0=
Σ
X tr Σ[ ]
X tr Σ[ ]
42
2.1.8 Cross-Correlation and Mutual Information for Gaussian Signal
Suppose is a zero-mean (without lose of generality because both cross-
correlation and mutual information have nothing to do with the mean) Gaussian random
variable with covariance matrix . The joint pdf will be
(2.48)
the two marginal pdfs are
(2.49)
The Shannon’s mutual information is
(2.50)
where is the correlation coefficient between and .
By using (A.1) in Appendix A and letting then we have
(2.51)
The ED-QMI and CS-QMI then will be
X x1 x2,( )T=
Σσ1
2r
r σ22
=
f x1 x2,( ) 1
2π( ) Σ 1 2⁄--------------------------e
12---XTΣ 1– X–
=
f1 x1( ) 1
2πσ1
-----------------e
x12
2σ12
---------–
= f2 x2( ) 1
2πσ2
-----------------e
x22
2σ22
---------–
=
Is x1 x2,( ) Hs x1( ) Hs x2( ) Hs x1 x2,( )–+12--- 1
1 ρ2–
--------------log= =
ρ r2 σ1
2σ22( )⁄=
ρ x1 x2
β σ1σ2=
VJ f x1 x2,( )2x1d x2d∫∫ 1
4πβ 1 ρ2–
-----------------------------= =
VM f1 x1( )2f2 x2( )2
x1d x2d∫∫ 14πβ----------= =
Vc f x1 x2,( )f1 x1( )f2 x2( ) x1d x2d∫∫ 2
4πβ 4 ρ2–
-----------------------------= =
43
(2.52)
Figure 2-6. Mutual Informations vs. correlation coefficient for Gaussian distribution
Similar to , is the function of only one parameter , and both are the monotonic
increasing function of with the same minimum value 0, the same minimum point
and the same maximum point in spite of the difference of the maximum
values. is the function of two parameters and . However, only serves as a sca-
lar of the function and can not change the shape of the function. Once is fixed, will
be the monotonic increasing function of with the same minimum value 0, the same min-
imum point and the same maximum point as and , in spite of the dif-
ference of the maximum values. Figure 2-6 shows these curves, which tells us the two
IED x1 x2,( ) 14πβ---------- 1
1 ρ2–
------------------ 4
4 ρ2–
------------------– 1+
=
ICS x1 x2,( ) 4 ρ2–
4 1 ρ2–
----------------------log=
Is IED β 0.5=( )ICS
Is ICS ρ
ρ
ρ 0= ρ 1=
IED ρ β β
β IED
ρ
ρ 0= ρ 1= Is ICS
44
case
infor-
timat-
y and
iven
ovari-
atrix as
ly on
of the
proposed ED-QMI and CS-QMI are consistent with Shannon’s MI in the Gaussian
regarding the minimum and maximum points.
2.2 Empirical Energy, Entropy and MI: Problem and Literature Review
In the previous section 2.1, the concept of various energy, entropy and mutual
mation quantities have been introduced. In practice, we are facing the problem of es
ing these quantities from given sample data. In this section, empirical energy, entrop
MI problems will be discussed, and the related literature review will be given.
2.2.1 Empirical Energy
The problem of empirical energy is relatively simple and straightforward. For a g
data set of a n-D signal , it is
not difficult to estimate the means, the variances of the marginal signals and the c
ance between the marginal signals. We have sample mean and sample variance m
follows [Dud73, Dud98]:
(2.53)
These are the results of maximum likelihood estimation [Dud73, Dud98].
2.2.2 Empirical Entropy and Mutual Information: The Problem
As shown in the previous section 2.1, the entropy and mutual information all re
the probability density function (pdf) of the variables, thus they use all the statistics
a i( ) a1 i( ) … an i( ), ,( )Ti 1 … N, ,== X x1 … xn, ,( )T
=
mi1N---- ai j( )
j 1=
N
∑= , i 1 … n, ,=
Σ 1N---- a j( ) mi–( ) a j( ) mi–( )T
j 1=
N
∑=
45
variables, but are more complicated and difficult to implement than the energy. To esti-
mate the entropy or mutual information, the first thing we need to do is to estimate the pdf
of the variables, then the entropy and mutual information can be calculated according to
the formula described in the previous section 2.1. For continuous variables, there are inev-
itable integrals in all the entropy and mutual information definitions described in 2.1,
which is the major difficulty after pdf estimation. Thus, the pdf estimation and the mea-
sures for entropy and mutual information should be appropriately chosen so that the corre-
sponding integrals can be simplified. In the rest of this chapter, we will see the importance
of the choice in practice. Different empirical entropies or mutual informations are actually
the results of different choices.
If a priori knowledge about the data distribution is known or a model is assumed, then
parametric methods can be used to estimate the pdf model parameters, and then the entro-
pies and mutual informations can be estimated based on the model and the estimated
parameters. However, in many real world problems the only available information about
the domain is contained in the data collected and there is no a priori knowledge about the
data. It is therefore practically significant to estimate the entropy of a variable or the
mutual information between variables based merely on the given data samples, without
further assumption or any a priori model assumed. Thus, we are actually seeking nonpara-
metric ways for the estimation of entropies and mutual informations.
Formally, the problems can be described as follows:
• The Nonparametric Entropy Estimation: given a data set for a
signal ( can be a scalar or n-D signal), how to estimate the entropy of without
any other informations or assumptions.
a i( ) i 1 … N, ,=
X X X
46
• The Nonparametric Mutual Information Estimation: given a data set
for a signal ( and can be
scalar or n-D signals, and their dimensions can be different), how to estimate the
mutual information between and without any assumption. This scheme can be
easily extended to the mutual information of multiple signals.
For nonparametric methods, there are still two major difficulties: the non-parametric
pdf estimation and the calculation of the integrals involved in the entropy and mutual
information measures. In the following, the literature review on these two aspects will be
given.
2.2.3 Nonparametric Density Estimation
The literature of nonparametric density estimation is fairly extensive. A complete dis-
cussion on this topic in such small section is virtually impossible. Here, only a brief
review on the relevant methods such as histogram, Parzen window method, orthogonal
series estimates, mixture model, etc. will be given.
• Histogram [Sil86, Weg72]:
Histogram is the oldest and most widely used density estimator. For a 1-D variable ,
given an origin and a bin width , the bins for the histogram can be defined as the
intervals . The histogram is then defined by
(2.54)
The histogram can be generalized by allowing the bin widths to vary. Formally, sup-
pose the real line has been dissected into bins, then the histogram can be
a i( ) a1 i( ) a2 i( ),( )T= i 1 … N, ,= X x1 x2,( )T
= x1 x2
x1 x2
x
x0 h
[ x0 mh x0 m 1+( )h )+,+
f x( ) 1Nh------- number of samples in the same bin as x( )=
47
(2.55)
For a multi-dimensional variable, histogram presets several difficulties. First, contour
diagrams to represent data can not be easily drawn. Second, the problem of choosing
the origin and the bins (or cells) are exacerbated. Third, if rectangular type of bins are
used for n-D variable and the number of bin for each marginal variable is , then the
number of bins is in the order of . Forth, since the histogram discretizes each
marginal variable, it is difficult to make further mathematical analysis.
• Orthogonal Series Estimates [Hay98, Com94, Yan97, Weg72, Sil86, Wil62, Kol94]:
This category includes Fourier Expansion, Edgeworth Expansion and Gram-Charlier
Expansion etc.. We will just discuss Edgeworth and Gram-Charlier Expansions for 1-
D variable.
Without the loss of generality, we assume that the random variable is zero-mean.
The pdf of can be expressed in terms of Gaussian function as
(2.56)
where are coefficients which depend on the cumulants of . e.g. , ,
, , , ,
, , etc., ( are ith order
cumulants); are the Hermite polynomials which can be defined in terms of the
kth derivative of the Gaussian function as , or
explicitly, , , , etc., and there is a recursive
f x( ) 1N---- number of samples in the same bin as x( )
width of the bin containing x( )---------------------------------------------------------------------------------------------------=
m
O mn( )
x
x G x( ) 1
2π----------e
x2 2⁄–=
f x( ) G x( ) 1 ckHk x( )k 3=
∞
∑+
=
ck x c1 0= c2 0=
c3 k3 6⁄= c4 k4 24⁄= c5 k5 120⁄= c6 k6 10k32
+( ) 720⁄=
c7 k7 35k4k3+( ) 5040⁄= c8 k8 56k5k3 35k42
+ +( ) 40320⁄= ki
Hk x( )
G x( ) Gk( )
x( ) 1–( )kG x( )Hk x( )=
H0 x( ) 1= H1 x( ) x= H2 x( ) x2
1–=
48
relation . Furthermore, biorthogonal property exists
between the Hermite polynomials and the derivatives of the Gaussian function:
(2.57)
where is the Kronecker delta which is equal to 1 if and 0 otherwise. (2.56)
is the so called Gram-Charlier expansion. It is important to note that the natural order
of the terms is not the best for the Gram-Charlier series. Rather, the grouping
is more appropriate. In practice, the expansion has
to be truncated. For BSS or ICA application, the truncation of the series at
is considered to be adequate. Thus, we have
(2.58)
where cumulants , ,
(moments ).
The Edgeworth expansion, on the other hand, can be defined as
(2.59)
There is no essential difference between the Edgeworth expansion and the Gram-
Charlier expansion. The key feature of the Edgeworth expansion is that its coefficients
decrease uniformly, while the terms in the Gram-Charlier expansion do not tend uni-
formly to zero from the viewpoint of numerical errors. This is why the terms in Gram-
Charlier expansion should be grouped as mentioned above.
Hk 1+ x( ) xHk x( ) kHk 1– x( )–=
Hk x( )Gm( )
x( ) xd∞–
∞
∫ 1–( )mm!δkm , k m,( ) 0 1 …, ,= =
δkm k m=
k 0( ) 3( ) 4 6,( ) 5 7 9, ,( ) …, , , ,=
k 4 6,( )=
f x( ) G x( ) 1k3
3!-----H3 x( )
k22
4!-----H4 x( )
k6 10k32
+( )6!
---------------------------H6 x( )+ + +
≈
k3 m3= k4 m4 3m22
–= k6 m6 10m32
15m2m4 30m23
+––=
mi E xi[ ]=
f x( ) G x( ) 1k3
3!-----H3 x( )
k4
4!-----H4 x( )
10k32
6!-----------H6 x( )
k5
5!-----H5 x( )+ + + +
=
35k3k4
7!-----------------H7 x( )
280k33
9!--------------H9 x( )
k6
6!-----H6 x( ) …
+ + + +
49
pular.
. For
aus-
ta set
whole
type
s that
nsity
e for
Both Edgeworth and Gram-Charlier expansions will be truncated in the real applica-
tion, which make them a kind of approximation to pdfs. Furthermore, they usually can
only be used for 1-D variable. For multi-dimensional variable, they become very com-
plicated.
• Parzen Window Method [Par62, Dud73, Dud98, Chr81, Vap95, Dev85]:
The Parzen Window Method is also called a kernel estimation method, or potential
function method. Several nonparametric methods for density estimation appeared in
the 60’s. Among these methods the Parzen window method is the most po
According to the method, one first has to determine the so-called kernel function
simplicity and the later use in this dissertation, we consider a simple symmetric G
sian kernel function:
(2.60)
where will control the kernel size and can be a n-D variables. For a da
described in 2.2.2, the density function will be
(2.61)
which means that each data point will be occupied by a kernel function and the
density is the average of all kernel functions. The asymptotic theory for Parzen
nonparametric density estimation was developed in the 70s [Dev85]. It conclude
(i) Parzen’s estimator is consistent (in the various metrics) for estimating a de
from a very wide classes of densities; (ii) The asymptotic rate of convergenc
G x σ2,( ) 1
2π( )k 2⁄ σk------------------------- x
Tx
2σ2---------–
exp=
σ x
f x( ) 1N---- G x a i( )– σ2,( )
i 1=
N
∑=
50
pter
qua-
the
t just
rnel
1. In
hbor-
this
ntial
mi-
been
del in
sumes
te quite
non-
ovari-
ted by
Parzen’s estimator is optimal for “smooth” densities. We will see later in this Cha
how this density estimation method can be combined with quadratic entropy and
dratic mutual information to develop the ideas of the information potential and
cross information potential. However, selecting the Parzen window method is no
only for simplicity but also for its good asymptotic properties. In addition, this ke
function is actually consistent with the mass-energy spirit mentioned in Chapter
fact, one data point should not only represent itself but also represent its neig
hood. The kernel function is nothing but more like a mass-density function in
sense. And from this point of view, it naturally introduce the idea of field and pote
energy. We will see this in a clearer way later in this chapter.
• Mixture Model [McL88, McL96, Dem77, Rab93, Hua90]:
The mixture model is a kind of “semi-parametric” method (or we may call it se
nonparametric). The mixture model, especially the Gaussian mixture model has
extensively applied in various engineering areas such as the hidden markov mo
speech recognition and many other areas. Although Gaussian mixture model as
that the data samples come from several Gaussian sources, it can approxima
diverse densities. Generally, the density for a n-D variable is assumed as
(2.62)
where is the number of mixture sources, are mixture coefficients which are
negative and their summation equals 1 , and are means and c
ance matrices for each Gaussian source where Gaussian function is nota
x
f x( ) ckG x µk– Σk,( )k 1=
K
∑=
K ck
ckk 1=
K
∑ 1= µi Σi
51
re not
lied to
utual
opy
rature,
chers
sum-
nner,
1 T 1–
with the mean and covariance
matrix as the parameters. All the parameters , and can be estimated from
data samples by the EM algorithm in the maximum likelihood sense. One may notice
the similarity between the Gaussian mixture model and the Gaussian kernel estimation
method. Actually, the Gaussian kernel estimation method is the extreme case of the
Gaussian mixture model where all the means are data points themselves and all the
mixture coefficients and all the covariance matrices are equal. In other words, each
data point in the Gaussian kernel estimation method is treated as a Gaussian source
with equal mixture coefficient and equal covariance.
There are also other nonparametric method such as the k-nearest neighbor method
[Dud73, Dud98, Sil86], the naive estimator [Sil86], etc.. These estimated density func-
tions are not the “natural density functions;” i.e., the integrations of these functions a
equal to 1. And their unsmoothness in data points also make them difficult to be app
the entropy or mutual information estimation.
2.2.4 Empirical Entropy and Mutual Information: The Literature Review
With the probability density function, we can then calculate the entropy or the m
information, where the difficulty lies in the integrals involved. Both Shannon’s entr
and Shannon’s mutual information are the dominating measures used in the lite
where the logarithm usually brings big difficulties in their estimations. Some resear
tried to avoid the use of Shannon’s measures in order to get some tractability. The
mary on various existing methods will be given and organized in the following ma
which will start with the simple method of histogram.
G x µ– Σ,( ) 1
2π( )n 2⁄ Σ 1 2⁄---------------------------------e
2--- x µ–( ) Σ x µ–( )–
= µ
Σ ck µk Σk
52
infor-
tion.
ious
s too
ity it
athe-
py or
lysis.
feature
rma-
• Histogram Based Method
If the pdf of a variable is estimated by the histogram method, the variable has to be dis-
cretized by histogram bins. Thus the integration in Shannon’s entropy or mutual
mation becomes a summation and there is no difficulty at all for its calcula
However, this is true only for a low dimension variable. As pointed out in the prev
section, for a high dimension variable, the computational complexity become
large for the method to be implementable. Furthermore, in spite of the simplic
made in the calculation, the discretization makes it impossible to make further m
matical analysis and to apply this method to the problem of optimization of entro
mutual information where differential continuous functions are needed for ana
Nevertheless, such simple method is still very useful in the cases such as the
selection [Bat94] where only the static comparison of the entropy or mutual info
tion is needed.
• The Case of Full Rank Linear Transform
From probability theory, we know that for a full rank linear transform where
and are all vectors in an n-dimensional real
space, is n-by-n full rank matrix, there is a relation between the density function of
and the density function of : [Pap91] where and are den-
sity of and respectively, and is the determinant operator. Accordingly, we
have the relation between the entropy of and the entropy of :
= = . So,
the output entropy can be expressed in terms of the input entropy .
Although may not be known, it may be fixed and the relation can be used for
Y WX=
X x1 … xn, ,( )T= Y y1 … yn, ,( )T=
W
X Y fY y( )fX x( )
det W( )---------------------= fY fX
Y X det( )
Y X
H Y( ) E fY y( )log–[ ]= E fX x( )log– det W( )log+[ ] H X( ) det W( )log+
H Y( ) H X( )
H X( )
53
n BSS
from
f
sed as
=
-
anip-
ple
nction
the purpose of the manipulation of the output entropy . This is the basis for a
series of methods in BSS and ICA areas. For instance, the mutual information among
the output marginal variables = -
- so that the minimization of the mutual information can be imple-
mented by the manipulation on the marginal entropies and the determinant of the lin-
ear transform. In spite of the simplicity, this method, however, is obviously coupled
with the structure of the transform (full rank is required, etc.), and thus is less general.
• InfoMax Method
Let’s look at a transformation , , =
, where is a monotonic increasing (or decreasing for the cases other tha
and ICA) function, and the linear transform is the same as the previous. Again,
probability theory [Pap91], we have where and are density o
and respectively, and is the Jacobian of the nonlinear transforms expres
the function of . Thus, there is the relation:
, where is approximated by the sam
ple mean method [Bel95]. The maximization of the output entropy can then be m
ulated by the two terms and . In addition to the sam
mean approximation, this method requires the match between the nonlinear fu
and the cdf of the sources signals when applied to BSS and ICA problems.
• Nonlinear Function By the Mixture Model
The above method can be generalized by using the mixture method to model the pdf of
sources [XuL97] and then the corresponding cdf; i.e., the nonlinear functions.
H Y( )
I y1 … yn, ,( ) H yi( )i 1=
n
∑ H Y( )–= H yi( )i 1=
n
∑det W( )log H X( )
Z z1 … zn, ,( )T= zi f yi( )= y1 … yn, ,( )T Y=
WX f( )
fZ z( )fY y( )J z( )-------------= fZ fY Z
Y J z( )
z H Z( ) H Y( ) E J z( )log[ ]+=
H X( ) det W( )log E J z( )log[ ]+ + E J z( )log[ ]
det W( )log E J z( )log[ ]
54
me
pply-
imple
.
Although this method avoid the arbitrary assumption on the cdf of the sources, it still
suffers from the problem such as the coupling with the structure of a learning machine.
• Numerical Method
The integration involved in the calculation of the entropy or mutual information is
usually complicated. A numerical method can be used to calculate the integration.
However, this method can only be used for low dimensional variables. [Pha96] used
the Parzen window method to estimate the marginal density and applied this method
for the calculation of the marginal entropies needed in the calculation of the mutual
information of the outputs of a linear transform described above. As pointed out by
[Vio95], the integration in Shannon’s entropy or mutual information will beco
extremely complicated when Parzen window is used for the density estimation. A
ing the numerical method makes the calculation possible but restricts itself to s
cases, and the method is also coupled with the structure of the learning machine
• Edgeworth and Gram-Charlier Expansion Based Method
As described above, both expansions can be expressed in the form
, where is a polynomial. By using the Taylor expansion,
we have for relative small . Then
= . Notice that
is the Gaussian function and and are all polynomials, this integra-
tion will have an analytical result. Thus a relation between the entropy and the coeffi-
cients of the polynomials and (i.e. the sample cumulants of the variable)
can be established. Unfortunately, this method can only be used for 1-D variable, and
f x( ) G x( ) 1 A x( )+( )= A x( )
1 A x( )+( )log A x( ) A x( )2
2--------------– B x( )= = A x( )
H x( ) f x( ) f x( )log xd∫–= G x( ) 1 A x( )+( ) G x( )log B x( )+( ) xd∫–
G x( ) A x( ) B x( )
A x( ) B x( )
55
thus it is usually used in the calculation of the mutual information described above for
BSS and ICA problems [Yan97, Yan98, Hay98].
• Parzen Window and Sample Mean
Similar to [Pha96], [Vio95] also uses the Parzen Window Method for the pdf estima-
tion. To avoid the complicated integration, [Vio95] used the sample mean to approxi-
mate the integration rather than numerical method in Pham [Pha96]. This is clear when
we express the entropy as . This method can used not only for 1-
D variables but also for n-D variables. Although this method is flexible, its sample
mean approximation restrict its precision.
• An Indirect Method Based on Parzen Window Estimation
Fisher [Fis97] uses an indirect way for entropy optimization. If is the output of an
mapping and is bounded in a rectangular type region
, then the uniform distribution will have the maxi-
mum entropy. So, for the purpose of entropy maximization, one can set up a MSE cri-
terion as
(2.63)
where is the uniform pdf in the region , is the estimated pdf of the output
by Parzen Window method described in the previous section. The gradient method
can be used for the minimization of . As an example, the partial derivative of with
respect to are
H x( ) E f x( )log–[ ]=
Y
D y ai yi bi i 1 … k, ,=,≤ ≤( ) =
J12--- u y( ) fY y( )–( )2
y u y( )bi ai–( ) y D∈;
i 1=
k
∏
0 ;otherwise
=d∫=
u y( ) D fY y( )
y
J J
wij
56
(2.64)
where are samples of the output. The partial derivative of the mean squared dif-
ference with respect to output samples, can be broken down as
(2.65)
where is the gradient of the Gaussian Kernel, is the convolution between
the uniform pdf and the gradient of the Gaussian Kernel , is the
convolution between the Gaussian Kernel and its gradient . As shown
in Fisher [Fis97], the convolution turns out to be
(2.66)
If domain is symmetric; i.e., , then the convolution
is
(2.67)
wij∂∂J
yp n( )∂∂J
wij∂∂ yp n( )
n 1=
N
∑p 1=
k
∑=
y n( )
y n( )∂∂J 1
N----Ku y n( )( ) 1
N2
------ KG y i( ) y n( )–( )i 1=
N
∑–=
Ku z( ) u z( ) Gg z( )• u y( )Gg z y–( ) yd∫= =
KG z( ) G z σ2,( ) Gg z( )• G y σ2,( )Gg z y–( ) yd∫= =
Gg z( )z∂
∂ G z σ2,( )=
Gg y( ) Ku z( )
u z( ) Gg z( ) KG z( )
G z σ2,( ) Gg z( )
KG z( )
KG z( ) 1
23k 4⁄( ) 1+ πk 4⁄ σ k 2⁄( ) 2+
-------------------------------------------------------- G z σ2,( )
1 2⁄z–=
D bi ai– a 2⁄= = i, 1 … k, ,=
Ku z( )
Ku z( ) 1
ak
-----
12--- erf
zia2---+
2σ-------------
erfzi
a2---–
2σ-------------
–
G1 z1a2---+ σ2,
G1 z1a2---– σ2,
–
i 1≠∏
…
12--- erf
zia2---+
2σ-------------
erfzi
a2---–
2σ-------------
–
Gk zka2---+ σ2,
Gk zka2---– σ2,
–
i k≠∏
=
57
h the
and
cula-
h the
-
prod-
(A.1)
we can
te the
where , is the same as (2.60),
is the error function.
This method is indirect and still depends on the topology of the network. But it also
shows the flexibility by using Parzen Window method. It has been used in practice with
good results for the MACE [Fis97].
Summarizing the above, we see that there is no direct efficient nonparametric method
to estimate the entropy or mutual information for a given discrete data set, which is decou-
pled from the structure of the learning machine and can be applied to n-D variables. In the
next sections, we will show how the quadratic entropy and the quadratic mutual informa-
tion rather than Shannon’s entropy and mutual information can be combined wit
Gaussian kernel estimation of pdfs to develop the ideas of “information potential”
“cross information potential,” resulting in a effective and general method for the cal
tion of the empirical entropy and mutual information.
2.3 Quadratic Entropy and Information Potential
2.3.1 The Development of Information Potential
As mentioned in the previous section, the integration of Shannon’s entropy wit
Gaussian kernel estimation for pdf will become “inordinately difficult” [Vio95]. How
ever, if we choose the quadratic entropy and notice the fact that the integration of the
uct of two Gaussian function can still be evaluated by another Gaussian function as
shows, then we can come up to a simple method. For a data set described in 2.2.2,
use Gaussian kernel method in (2.61) to estimate pdf of and then to calcula
“entropy 2-norm” as
z z1 … zk, ,( )T= G z σ2,( ) erf x( ) 1
2π---------- x
2
2-----–
expx–
x
∫=
X
58
much
ds to
t there
tation
loca-
spirit
some
,
tive
(2.68)
So, Renyi’s quadratic entropy and Havrda-Charvat’s quadratic entropy lead to a
simpler entropy estimator for a set of discrete data points :
(2.69)
The combination of the quadratic entropies with the Parzen window method lea
entropy estimator that computes the interactions among pairs of samples. Notice tha
is no approximation in these evaluations except pdf estimation.
We wrote (2.69) in this way because there is a very interesting physical interpre
for this estimator of entropy. Let us assume that we place physical particles in the
tions prescribed by and . Actually, the Parzen window method is just in the
of mass-energy. The integration of the product of two Gaussian kernels representing
kind of mass density can be regarded as the interaction between particles and
which results in the potential energy . Notice that it is always posi
V fX x( )2xd
∞–
+∞
∫=
1N---- G x a i( )– σ2,( )
i 1=
N
∑
1N---- G x a j( )– σ2,( )
j 1=
N
∑
xd∞–
+∞
∫=
1
N2
------ G x a i( )– σ2,( )G x a j( )– σ2,( ) xd∞–
+∞
∫j 1=
N
∑i 1=
N
∑=
1
N2
------ G a i( ) a j( )– 2σ2,( )j 1=
N
∑i 1=
N
∑=
a i( ) i 1= … N, ,
HR2 X a ( ) Vlog–=
Hh2 X a ( ) 1 V–=
V1
N2
------ G a i( ) a j( )– 2σ2,( )j 1=
N
∑i 1=
N
∑=
a i( ) a j( )
a i( ) a j( )
G a i( ) a j( )– 2σ2,( )
59
orma-
ll be
ten-
ion of
anics
prin-
here
was
ented
tropy
se an
ons to
and is inversely proportional to the distance square between the particles. We can consider
that a potential field exists for each particle in the space of with a field strength defined by
the Gaussian kernel; i.e., an exponential decay with the distance square. In the real world,
physical particles interact with the potential energy inverse to the distance between them.
but here the potential energy abides by a different law which in fact is determined by the
kernel in pdf estimation. in (2.69) is the overall potential energy including each pair of
data particles. As pointed out previously, these potential energies are related to “inf
tion” and thus are called “information potentials” (IP). Accordingly, data samples wi
called “information particles” (IPT). Now, the entropy is expressed in terms of the po
tial energy and the entropy maximization now becomes equivalent to the minimizat
the information potential. This is again a surprising similarity to the statistical mech
where the entropy maximization principle has a corollary of the energy minimization
ciple. It is a pleasant surprise to verify that the nonparametric estimation of entropy
ends up with a principle that resembles the one of the physical particle world which
one of the origin of the concept of entropy.
We can also see from (2.68) and (2.69) that the Parzen window method implem
with the Gaussian kernel and coupled with Renyi’s entropy or Havrda-Charvat’s en
of higher order (α>2) will compute each interaction among α-tuples of samples, providing
even more information about the detailed structure and distribution of the data set.
2.3.2 Information Force (IF)
Just like in mechanics, the derivative of the potential energy is a force, in this ca
information driven force that moves the data samples in the space of the interacti
change the distribution of the data and thus the entropy of the data. Therefore,
V
60
arti-
f the
le-
ation
rma-
plot
d the
n be
(2.70)
can be regarded as the force that a particle in the position of sample impinges upon
and will be called an information force. If all the data samples are free to move in a
certain region of the space, then the information forces between each pair of samples will
drive all the samples to a state with minimum information potential. If we add all the con-
tributions of the information forces from the ensemble of samples on we have the
overall effect of the information potential on sample ; i.e.,
(2.71)
The Information force is the realization of the interaction among “information p
cles.” The entropy will change towards the direction (for each information particle) o
information force. Accordingly, Entropy maximization or minimization could be imp
mented in a simple and effective way.
2.3.3 The Calculation of Information Potential and Force
The above has given the concept of the information potential and the inform
force. Here, the procedure for the calculation of the information potential and the info
tion force will be given according to the formula above. The procedure itself and the
here may even help to further understand the idea of the information potential an
information force.
To calculate the information potential and the information force, two matrices ca
defined as (2.72) and their structures are illustrated in Figure 2-7.
a i( )∂∂ G a i( ) a j( ) 2σ2,–( ) G a i( ) a j( ) 2σ2,–( ) a j( ) a i( )–( ) 2σ2( )⁄=
a j( )
a i( )
a i( )
a i( )
a i( )∂∂V 1–
N2σ2
------------ Gj 1=
N
∑ a i( ) a j( )– 2σ2,( ) a i( ) a j( )–( )=
61
(2.72)
Figure 2-7. The structure of Matrix D and V
Notice that each element of is a vector in space while each element of is a
scalar. It is easy to show from the above that
(2.73)
where is the overall information potential, is the force that receives.
We can also define the information potential for each particle as
. Obviously,
From this procedure, we can clearly see that the information potential relies on the dif-
ference between each pair of data points, and therefore makes full use of the information
of their relative position; i.e., the data distribution.
D d ij( ) = d ij( ), a i( ) a j( )–=
v v ij( ) v ij( ), G d ij( ) 2σ2,( )= =
a 1( ) a 2( ) a N( )
a i( ) a j( )–
a j( )… …a 1( )a 2( )
a i( )
a N( )
……
D Rn
v
V1
N2
------ v ij( )j 1=
N
∑i 1=
N
∑=
f i( ) 1–
N2σ2
------------ v ij( )d ij( )j 1=
N
∑= i 1 … N, ,=
V f i( ) a i( )
a i( )
v i( ) 1N---- v ij( )
j 1=
N∑= V1N---- v i( )
i 1=
N
∑=
62
2.4 Quadratic Mutual Information and Cross Information Potential
2.4.1 QMI and Cross Information Potential (CIP)
For the given data set of a variable
described in 2.2.2, the joint and marginal pdfs can be estimated by the
Gaussian kernel method as
(2.74)
Following the same procedure as the development of the information potential, we can
obtain the three terms in ED-QMI and CS-QMI based only on the given data set:
(2.75)
If we define similar matrices to (2.72), then we have
a i( ) a1 i( ) a2 i( ),( )T= i 1 … N, ,=
X x1 x2,( )T=
fx1x2x1 x2,( ) 1
N---- G x1 a1 i( )– σ2,( )G x2 a2 j( )– σ2,( )
i 1=
N
∑=
fx1x1( ) 1
N---- G x1 a1 i( )– σ2,( )
i 1=
N
∑=
fx2x2( ) 1
N---- G x2 a2 i( )– σ2,( )
i 1=
N
∑=
VJ1
N2
------ G a i( ) a j( )– 2σ2,( )j 1=
N
∑i 1=
N
∑=
1
N2
------ G a1 i( ) a1 j( )– 2σ2,( )G a2 i( ) a2 j( )– 2σ2,( )j 1=
N
∑i 1=
N
∑=
VM V1V2=
Vk1
N2
------ G ak i( ) ak j( )– 2σ2,( ), kj 1=
N
∑i 1=
N
∑ 1 2,= =
Vc1N---- 1
N---- G a1 i( ) a1 j( )– 2σ2,( )
j 1=
N
∑
1N---- G a2 i( ) a2 j( )– 2σ2,( )
j 1=
N
∑
i 1=
N
∑=
63
(2.76)
where is the information potential in the joint space, thus is called the joint potential;
is the information potential in the marginal space, thus is called the marginal poten-
tial; is the joint information potential energy for IPT ; is the marginal
information potential energy for the marginal IPT in the marginal space indexed by
. Based on these quantities, the above three terms can be expressed as
(2.77)
So, ED-QMI and CS-QMI can be expressed as
D d ij( ) = d ij( ), a i( ) a j( )–=
Dk dk ij( ) , dk ij( ) ak i( ) ak j( ), k– 1 2,= = =
v v ij( ) v ij( ), G d ij( ) 2σ2,( )= =
vk vk ij( ) vk ij( ), G dk ij( ) 2σ2,( ), k 1 2,= = =
v i( ) 1N---- v ij( )
j 1=
N
∑= , vk i( ) 1N---- vk ij( ), k
j 1=
N
∑ 1 2,= =
v ij( )
vk ij( )
v i( ) a i( ) vk i( )
ak i( )
k
VJ1
N2
------ v ij( )j 1=
N
∑i 1=
N
∑ 1
N2
------ v1 ij( )v2 ij( )j 1=
N
∑i 1=
N
∑= =
VM V1V2=
Vk1
N2
------ vk ij( )j 1=
N
∑i 1=
N
∑= , k 1 2,=
Vc1N---- v1 i( )v2 i( )
i 1=
N
∑=
64
(2.78)
From the above, we can see that both QMIs can be expressed as the cross-correlations
between the marginal information potentials at different levels: ,
and . Thus, the above measure is called the Euclidean distance cross informa-
tion potential (ED-CIP), and the measure is the called Cauchy-Schwartz cross infor-
mation potential (CS-CIP).
The quadratic mutual information and the corresponding cross information potential
can be easily extended to the case with multiple variables, e.g. . In this
case, we have similar matrices and and all similar IPs and marginal IPs. Then we
have the ED-QMI and CS-QMI and their corresponding ED-CIP and CS-CIP as follows.
(2.79)
IED x1 x2,( ) VED1
N2
------ v1 ij( )v2 ij( ) 2N---- v1 i( )v2 i( ) V1V2+
i 1=
N
∑–j 1=
N
∑i 1=
N
∑= =
ICS x1 x2,( ) VCS
1
N2
------ v1 ij( )v2 ij( )j 1=
N
∑i 1=
N
∑
V1V2( )
1N---- v1 i( )v2 i( )
i 1=
N
∑ 2
------------------------------------------------------------------------------log= =
v1 ij( )v2 ij( ) v1 i( )v2 i( )
V1V2 VED
VCS
X x1 …xK,( )T=
D v
IED x1 … xK, ,( ) VED1
N2
------ vk ij( ) 2N---- vk i( ) Vk
k 1=
K
∏+k 1=
K
∏i 1=
N
∑–k 1=
K
∏j 1=
N
∑i 1=
N
∑= =
ICS x1 … xK, ,( ) VCS
1
N2
------ vk ij( )k 1=
K
∏j 1=
N
∑i 1=
N
∑
Vkk 1=
K
∏
1N---- vk i( )
k 1=
K
∏i 1=
N
∑ 2
------------------------------------------------------------------------------log= =
65
o-
trices
2.4.2 Cross Information Forces (CIF)
The cross information potential is more complex than the information potential. Three
different terms (or potentials) contribute to the cross information potential. So, the force
that one data point receives comes from these three sources. A force in the joint
space can decomposed into marginal components. The marginal force in each marginal
space should be considered separately to simplify the analysis. The case of ED-CIP and
CS-CIP are different. They should also be considered separately. Only the cross informa-
tion potential between two variables will be dealt with here. The case for multiple vari-
ables can be readily obtained in a similar way.
First, let’s look at the CIF of ED-CIP . By the similar derivation pr
cedure to that of the Information Force in IP field, we can obtain the following
(2.80)
where all , , are defined as the previous ones, are cross ma
which serve as force modifiers.
For the CIF of CS-CIP, similarly, we have
(2.81)
a i( )
ak i( )∂∂VED k 1 2,=( )
Ck ck ij( ) , ck ij( ) vk ij( ) vk i( )– vk j( )– Vk+= , k 1 2,= =
fk i( )ak i( )∂
∂VED 1–
N2σ2
------------ cl ij( )vk ij( )dk ij( )j 1=
N
∑= =
i 1 … N, k, , 1 2 l k≠,= =
dk ij( ) vk ij( ) vk i( ) Vk Ck
fk i( )ak i( )∂
∂VCS 1VJ-----
ak i( )∂∂VJ 2
Vc-----
ak i( )∂∂Vc–
1Vk-----
ak i( )∂∂Vk+= =
1–
σ2------
v1 ij( )v2 ij( )dk ij( )j 1=
N
∑
v1 ij( )v2 ij( )j 1=
N
∑i 1=
N
∑----------------------------------------------------
vk ij( )dk ij( )j 1=
N
∑
vk ij( )j 1=
N
∑i 1=
N
∑-------------------------------------
vl i( ) vl j( )+( )vk ij( )dk ij( )j 1=
N
∑
N v1 i( )v2 i( )i 1=
N
∑-----------------------------------------------------------------------–+=
66
rginal
Ts”
od
n all
lidean
y vir-
virtual
T has
les
Figure 2-8. Illustration of “real IPT” and “virtual IPT”
2.4.3 An Explanation to QMI
Another way to look at the CIP comes from the expression of the factorized ma
pdfs. From the above, we have
(2.82)
This suggests that in the joint space, there are “virtual IP
whose pdf estimated by the Parzen Window meth
will be exactly the factorized marginal pdfs of the “real IPTs.” The relation betwee
types of IPTs is illustrated in Figure 2-8.
From the above description, we can see that the ED-CIP is the square of the Euc
distance between real IP field (formed by real IPTs) and the virtual IP field (formed b
tual IPTs), and the CS-CIP is related to the angle between the real IP field and the
IP field as Figure 2-5 shows. When real IPTs are organized such that each virtual IP
at least one real IPT in the same position, the CIP is zero and two marginal variab
x1
x2
a2 i( )
a1 i( )
a1 i( ) a2 j( ),( )Ta1 i( ) a2 i( ),( )T
real IPT virtual IPT
marginal IPT
a2 j( )
fx1x1( )fx2
x2( ) 1
N2
------ G x1 a1 i( )– σ2,( )G x2 a2 j( )– σ2,( )j 1=
N
∑i 1=
N
∑=
N2
a1 i( ) a2 j( ),( )T i j 1= … N, , , ,
x1
67
and are statistically independent; when real IPTs are distributed along a diagonal line,
the difference between the distribution of real IPTs and virtual IPTs is maximized. Two
extreme cases are illustrated in Figure 2-9 and Figure 2-10. It should be noticed that both
and are not necessarily scalars. Actually, they can be multidimensional variables,
and their dimensions can be even different. CIPs are general measures for the statistical
relation between two variables (based merely on given data).
Figure 2-9. Illustration of Independent IPTs
Figure 2-10. Illustration of Highly Correlated Variables
x2
x1 x2
x1
x2
a2 i( )
a1 i( )
a1 i( ) a2 j( ),( )Ta1 i( ) a2 i( ),( )T
real IPT virtual IPT
marginal IPT
a2 j( )
x1
x2
a2 i( )
a1 i( )
a1 i( ) a2 j( ),( )Ta1 i( ) a2 i( ),( )T
real IPT virtual IPT
marginal IPT
a2 j( )
” and
am-
rning
learn-
liza-
map-
sys-
.
and
twork,
CHAPTER 3
LEARNING FROM EXAMPLES
A learning machine is usually a network. Neural networks are of particular interest in
this dissertation. Actually, almost all adaptive systems can be regarded as network models,
no matter if they are linear or nonlinear, feedforward or recurrent. In this sense, the learn-
ing machines studied here are neural networks. So, learning, in this circumstance, is a pro-
cess by which the free parameters of a neural network are adapted through a process of
stimulation by the environment in which the network is embedded [Men70]. The environ-
mental stimulation, as pointed out in Chapter 1, is usually in the form of “examples,
thus learning is about how to obtain information from “examples.” “Learning from ex
ples” is the topic of this chapter, which will include the review and discussion on lea
systems, learning mechanisms, the information-theoretic viewpoint about learning, “
ing from examples” by the information potential, and finally a discussion on genera
tion.
3.1 Learning System
According to the abstract model described in Chapter 1, a learning system is a
ping network. The flexibility of the mapping highly depends on the structure of the
tem. The structure of several typical network systems will be reviewed in this section
Network models can basically be divided into two categories: static models
dynamic models. The static model can also be called a memory-less model. In a ne
68
69
memory about the signal past is obtained by using delayed connections (the connections
through delay units) (In continuous time case delay connections become feedback connec-
tions. In this dissertation, only discrete time signals and systems are studied). Generally
speaking, if there are delay units in a network, then the network will have memory. For
instance, the transversal filter [Hay96, Wid85, Hon84], the general IIR filter [Hay96,
Wid85, Hon84], the time delay neural network (TDNN) [Lan88, Wai89], the gamma neu-
ral network [deV92, Pri93], the general recurrent neural networks [Hay98, Hay94], etc.
are all dynamic network systems with memory or delay connections. If a network has
delay connections, it has to be described by difference equations (in the continuous time
case, differential equations), while a static network can be expressed by algebraic equa-
tions (linear or nonlinear).
There is also another taxonomy for the structure of learning or adaptive systems. For
instance, linear models and nonlinear models belongs to another category. The following
will start with the static linear model.
3.1.1 Static Models
E. Linear Model
Possibly, the simplest mapping network structure is the linear model. Mathematically,
it is a linear transformation. As shown in Figure 3-1, the input and output relation of the
network is defined by (3.1).
(3.1)y w
Tx= y, y1 … yk, ,( )T
Rk∈=
x Rm∈ w, w1 … wk, ,( ) R
m k×∈= wi Rm∈,
70
where is the input signal and is the output signal, is the linear transformation matrix where
each column ( ) is a vector. Each output or group of outputs is a subspace of the
input signal space. Eigenanalysis (principal component analysis) [Oja82, Dia96, Kun94, Dud73,
Dud98] and generalized eigenanalysis [XuD98, Cha97, Dud73, Dud98] are seeking signal sub-
space with maximum signal-to-noise ratio (SNR) or signal-to-signal ratio. For pattern classifica-
tion, subspace methods such as Fisher Discriminant Analysis are also very useful tools [Oja82,
Dud73, Dud98]. Linear models can also be used for inverse problems such as BSS and ICA
[Com94, Cao96, Car98b, Bel95, Dec96, Car97, Yan97]. The linear model is simple, and it
is very effective for a wide range of problems. The understanding of the learning behavior
of a linear model may also help the understanding of nonlinear systems.
Figure 3-1. Linear Model
F. Multilayer Perceptron (MLP)
The multilayer perceptron is the extension of the perceptron model [Ros58, Ros62,
Min69]. The perceptron is similar to the linear model in Figure 3-1 but with nonlinear
functions in each output node, e.g. a hard limit function . The per-
x y w
wi i 1 …k,=
w2w1 wk
y1yky2
x
f x( )1, x 0≥1 , x 0<–
=
71
ceptron initiated the mathematical analysis of learning and it is the first machine which
learns directly from examples [Vap95]. Although the perceptron demonstrated an amazing
learning ability, its performance is still limited by its single layer structure [Min69]. The
MLP extends the perceptron by putting more layer in the network structure as shown in
Figure 3-2. For the ease of mathematical analysis, the nonlinear function in each node is
usually a continuous differentiable function, e.g. the sigmoid function
. (3.2) gives a typical input-output relation of the network in Figure
3-2:
(3.2)
where and are the biases for the node and respectively, and
are the linear projections for node and respectively. The layer of nodes is called
hidden layer which is neither input nor output. MLPs may have more than one hidden lay-
ers. The nonlinear function may be different for different nodes. Each node in an
MLP is a simple processing element which is abstracted functionally from a real neuron
cell, called the McCullock-Pitts model [Hay98, Ru86a]. Collective behavior emerges
when these simple elements are connected with each other to form a network whose over-
all function can be very complex [Ru86a].
One of the most appealing properties of the MLP is its universal approximation ability.
It has been shown that as long as there are enough hidden nodes, an MLP can approximate
any functional mapping [Hec87, Gal88, Hay94, Hay98]. Since a learning system is noth-
ing but a mapping from an abstract point of view, the universal approximation property of
f x( ) 1 1 ex–
+( )⁄=
zi f wiTx bi+( )= i 1 … l, ,=
yj f vjTz aj+( )= z z1 … zl, ,( )T
= j 1 … k, ,=
bi aj zi yj vj Rl∈ wi R
m∈
yj zi z
f( )
72
per-
e. The
also
ion of
tions.
urface
proxi-
non-
func-
the MLP is a very desirable feature for a learning system. This is one reason why the MLP
is so popular. The MLP is a kind of “global” model whose basic building block is a hy
plane which is the projection represented by the sum of the products at each nod
nonlinear function at each node distorts its hyperplane to a ridge function which
serves as a selector. So, the overall functional surface of a MLP is the combinat
these ridge function. The number of hidden nodes provides the number of ridge func
Therefore, as long as the number of nodes is large enough, the overall functional s
can approximate any mapping. This is an intuitive understanding of the universal ap
mation property of the MLP.
Figure 3-2. Multilayer Perceptron
G. Radial-Basis Function (RBF)
As shown in Figure 3-3, the RBF network has two layers. the hidden layer is the
linear layer, whose input-output relation is a radial-basis function, e.g. the Gaussian
w2w1 wl
z1 zlz2
x
y1 yky2
v1 v2 vk
73
erall
1
tion: , where is the mean (center) of the Gaussian function and
determines the location of the Gaussian function in the input space, is the variance of
the Gaussian function and determines the shape or sharpness of the Gaussian function.
The output layer is a linear layer. So the overall input-output relation of the network can
be expressed as
(3.3)
where are linear projections, and are the same as above.
Figure 3-3. Radial-Basis Function Network (RBF Network)
The RBF network is also a universal approximator if the number of hidden nodes is
large enough [Pog90, Par91, Hay98]. However, unlike the MLP, the basic building block
is not a “global” function but a “local” one such as the Gaussian function. The ov
zi e 2σi
2---------– x µi–
= µi
σi2
zi e
1
2σi2
---------– x µi–
= i 1 … l, ,=
yj wjTz= z z1 … zl, ,( )T
= j 1 … k, ,=
wj σi2 µi
w2w1 wk
z1 zlz2
x
y1 yky2
74
Intu-
imated
basic
g90,
sists
ction.
emen-
a wide
mapping surface is approximated by the linear combination of such “local” surfaces.
itively, we can also imagine that any shape of the mapping surface can be approx
by the linear combination of small piece of local surfaces if there is enough such
building blocks. The RBF network is also an optimal regularization function [Po
Hay98]. It has been applied as extensively as the MLP in various areas.
3.1.2 Dynamic Models
H. Transversal Filter
The transversal filter, also referred to as a tapped-delay line filter or FIR filter, con
of two parts (as depicted in Figure 3-4): (1) the tapped-delay line, (2) the linear proje
The input-output relation can be expressed as
(3.4)
where are the parameters of the filter. Because of its versatility and ease of impl
tation, the transversal filter has become an essential signal processing structure in
variety of applications [Hay96, Hon84].
Figure 3-4. Transversal Filter
y n( ) wix n i–( )i 0=
q
∑ wTx, w w0 … wq, ,( )T
= = = , x x n( ) … x n q–( ), ,( )T=
wi
z 1– z 1– … z 1–
Σ
y
x
w1 w2 wq 1–wq
w0
75
n be
delay
em-
line).
Figure 3-5. Gamma Filter
I. Gamma Model
As shown in Figure 3-5, the gamma filter is similar to transversal filter except that the
tapped delay line is replaced by the gamma memory line [deV92, Pri93]. The gamma
memory is a delay tap with feedback. The transfer function of one tap gamma memory is
(3.5)
The corresponding impulse response is the gamma function with one parameter :
(3.6)
For the pth tap of the gamma memory line, the transfer function and its impulse response
(the gamma function) are
(3.7)
Compared with the tapped delay line, the gamma memory line is a recursive structure
and has infinite length of impulse response. Therefore, the “memory depth” ca
adjusted by the parameter instead of fixed by the number of taps in the tapped
line. Compared with the general IIR filter, the analysis of the stability of the gamma m
ory is simple. When , the gamma memory line is stable (everywhere in the
w3
µ µ GG G Gµ µ
w1 w2 wq
Σz 1–
µ
1 µ–
+
y
x …1 Tap Gamma Memory
G z( ) µz1–
1 1 µ–( )z1–
–--------------------------------- µ
z 1 µ–( )–-------------------------= =
p 1=
g n( ) µ 1 µ–( )n 1–, n 1≥=
Gp z( ) µz 1 µ–( )–-------------------------
p= gp n( )
n 1–
p 1– µp
1 µ–( )n p–= n p≥,
µ
0 µ 2< <
76
And also when , the gamma memory line becomes the tapped delay line. So, the
gamma memory line is the generalization of the tapped delay line. The gamma filter is a
good compromise between the FIR filter and the IIR filter. It has been widely applied to a
variety of signal processing and pattern recognition problems.
J. The All Pole IIR Filter
Figure 3-6. The All Pole IIR Filter
As shown in Figure 3-6, the all pole IIR filter is composed of only the delayed feed-
back and there is no feedforward connections in the network structure. The transfer func-
tion of the filter is
(3.8)
Obviously, this is the inverse system of the FIR filter which has
been used in deconvolution problems [Hay94a]. There are also its counterpart for two
inputs and two outputs system, which has been used in the blind source and blind source
separation problems [Ngu95, Wan96]. In general, this type of filters may be very useful in
inverse, or system identification problem.
µ 1=
+x y
z1–
z1–
w1 wn
…
H z( ) 1
1 wizi–
i 1=
n
∑–
------------------------------=
H z( ) 1 wizi–
i 1=
n
∑–=
77
K. TDNN and Gamma Neural Network
In an MLP, each connection is instantaneous and there is no temporal structure in it. If
the instantaneous connections are replaced by a filter. then each node will have the ability
to process time signals. The time delay neural network (TDNN) is formed by replacing the
connections in the MLP with transversal filters [Lan88, Wai89]. The gamma neural net-
work is the result of replacing the connections in the MLP with gamma filters [deV92,
Pri93]. These types of neural networks extend the ability of the MLP.
Figure 3-7. Multilayer Perceptron with Delayed Connections
L. General Recurrent Neural Network
A general nonlinear dynamic system is the multilayer perceptron with some delayed
connections. As Figure 3-7 shows, for instance, the output of node relies on the previ-
ous output of node :
(3.9)
w2w1 wl
z1 zlz2
x
y1 yky2
v1 v2 vk
DelayedConnection
d
zl
yk
zl n( ) f wlTx n( ) bl dyk n 1–( )+ +( )=
78
thod
varied
erse
also
ecta-
There may be some other nodes which have the similar delayed connections. This type of
neural network is powerful but complicated. It is difficult to analyze adaptation although
its flexibility and potential are high.
3.2 Learning Mechanisms
The central part of a learning mechanism is the criterion. The range of application of a
learning system may be very broad. For instance, a learning system or adaptive signal pro-
cessing system can be used for data compression, encoding or decoding signals, noise or
echo cancellation, source separation, signal enhancement, pattern classification, system
identification and control, etc.. However, the criterion to achieve such diverse purposes
can be basically divided into only two types: one is based on the energy measures; the
other is based on information measures. As pointed out in Chapter 2, the energy measures
can be regarded as special cases of information measures. In the following, various energy
measures and information measures will be discussed.
Once the criterion of a system is determined, the task left is to adjust the parameters of
the system so as to optimize the criterion. There are a variety of optimization techniques.
The gradient method is perhaps the simplest but it is a general method [Gil81, Hes80,
Wid85] which is based on the first order approximation of the performance surface. Its on-
line version--the stochastic gradient method [Wid63] is widely used in adaptive and learn-
ing systems. Newton’s method [Gil81, Hes80, Wid85] is a more sophisticated me
which is based on the second order approximation of the performance surface. Its
version--the conjugate gradient method [Hes80] will avoid the calculation of the inv
of the Hessian matrix and thus is computationally more efficient [Hes80]. There are
other techniques which are efficient for specific applications. For instance, the Exp
79
tion and Maximization algorithm for the maximum likelihood estimation or a class of non-
negative function maximization [Dem77, Mcl96, XuD95, XuD96]. The natural gradient
method by means of information geometry is used in the case where the parameter space
is constrained [Ama98]. In the following, various techniques will also be briefly reviewed.
3.2.1 Learning Criteria
• MSE Criterion
The mean squared error (MSE) criterion is one of the most widely used criteria. For
the learning system described in Chapter 1, if the given environmental data is
where is the input signal and is the desired
signal, then the output signal is and the error signal is
. The MSE criterion can be defined as
(3.10)
It is basically the squared Euclidean distance between desired signal and the out-
put signal from the geometrical point of view, and the energy of the error signal
from the point of view of the energy and entropy measures. Minimization of the MSE
criterion will result in a closest output signal to the desired signal in the Euclidean dis-
tance sense. As mentioned in Chapter 2, if we assume the error signal is white Gauss-
ian with zero-mean, then the minimization of the MSE is equivalent to the
minimization of the entropy of the error signal.
x n( ) d n( ),( ) n 1 … N, ,= x n( ) d n( )
y n( ) q x n( ) W,( )=
e n( ) d n( ) y n( )–=
J12--- e n( )2
n 1=
N
∑ 12--- d n( ) y n( )–( )2
n 1=
N
∑= =
d n( )
y n( )
80
For a multiple output system; i.e., the output signal and the desired signal are multi-
dimensional, the error signal is then multi-dimensional and the definition of the MSE
criterion is the same as described in Chapter 2.
• Signal-to-Noise Ratio (SNR)
The signal-to-noise ratio is also a frequently used criterion in the signal processing
area. The purpose of many signal processing systems is to enhance the SNR. A well
known example is the principal component analysis (PCA), where a linear projection
is desired such that the SNR in the output is maximized (when the noise is assumed to
be white Gaussian). For the linear model described above , ,
and , if the input is zero-mean and its covariance matrix is ,
then the output power (short time energy) is . If the
input is --a zero-mean white Gaussian noise with covariance matrix being iden-
tity matrix , then the output power of the noise is . The SNR in the output of the
linear projection will be
(3.11)
From the information-theoretic point of view, the entropy of the output will be
(3.12)
where the input signal is assumed zero-mean Gaussian signal. Then the entropy dif-
ference is
y wTx= y R∈ x R
m∈
w Rm∈ x Rx E xx
T[ ]=
E y2[ ] w
TE xx
T[ ]w wTRxw= =
xnoise
I wTw
Jw
TRxw
wTw
-----------------=
H wTxnoise( ) 1
2--- w
Tw( )log
12--- 2πlog
12---+ +=
H wTx( ) 12--- w
TRxw( )log
12--- 2πlog
12---+ +=
x
81
(3.13)
which is equivalent to the SNR criterion. The solution to this problem is the eigenvec-
tor that corresponds to the largest eigenvalue of .
The PCA problem can also be formulated as the minimum reconstruction MSE prob-
lem [Kun94]:
(3.14)
(3.14) can also be regarded as an auto-association problem in a two-layer network with
the constraints that the two layer weights should be dual with each other (i.e. one is the
transpose of the other). The minimization solution to (3.14) is equivalent to the maxi-
mization solution to (3.12) or (3.13).
• Signal-to-Signal Ratio
For the same linear network, if the input signal is switched between two zero-mean
signals and , then the signal-to-signal ratio in the output of the linear projection
will be
(3.15)
where is the covariance matrix of , and is the covariance matrix of . The
Maximization of this criterion is to enhance the signal in the output and to attenuate
the signal at the same time. From the information-theoretic point of view, if both
signals are Gaussian signals, then the entropy difference in the output will be
J H wTx( ) H w
Txnoise( )–
12---
wTRxw
wTw
-----------------log= =
Rx
J E wwTx x–( )
2[ ]=
x1 x2
Jw
TRx1
w
wTRx2
w-------------------=
Rx1x1 Rx2
x2
x1
x2
82
from
lied in
rkov
a sta-
s, and
that
cri-
ers ,
e
ion
s the
(3.16)
which is equivalent to a signal-to-signal ratio. The maximization solution to (3.15) or
(3.16) is the generalized eigenvector with the largest generalized eigenvalue:
(3.17)
[Cha97] also shows that when this criterion is applied to classification problems, it can
be formulated as a heteroassociation problem with a MSE criterion and a constraint.
• The Maximum Likelihood
The maximum likelihood estimation has been widely used in the parametric model
estimation [Dud98, Dud73]. It has also been extensively applied to “learning
examples.” For instance, the hidden markov model has been successfully app
the speech recognition problem [Rab93, Hua90]. Training of most hidden ma
models is based on maximum likelihood estimation. In general, suppose there is
tistical model where is a random variable and are a set of parameter
the true probability distribution is but unknown. The problem is to find so
is the closest to . We can simply apply the information cross-entropy
terion, i.e. the Kullback-Leibler criterion to the problem:
(3.18)
where is the Shannon entropy of which does not depend on the paramet
and is exactly the log likelihood function of . So, th
minimization of (3.18) is equivalent to the maximization of the log likelihood funct
. In other words, the maximum likelihood estimation is exactly the same a
J H wTx1( ) H w
Tx2( )–
12---
wTRx1
w
wTRx2
w-------------------log= =
Rx1woptimal λmaxRx2
woptimal=
p z w,( ) z w
q z( ) w
p z w,( ) q z( )
J w( ) q z( ) q z( )p z w,( )-----------------log zd∫ E p z w,( )log[ ]– Hs z( )+= =
Hs z( ) z w
L w( ) E p z w,( )log[ ]= p z w,( )
L w( )
83
minimum Kullback-Leibler cross-entropy between the true probability distribution
and the model probability distribution [Ama98].
• The Information-Theoretic Measures for BBS and ICA
As introduced in Chapter 2, the maximization of the output entropy and the minimiza-
tion of the mutual information between the outputs can be used in BBS and ICA prob-
lems. We will deal with this case in more details later.
3.2.2 Optimization Techniques
• The Back-Propagation Algorithm
In general, for a function : , the gradient is the steepest ascent
direction for , and is the steepest descent direction for , and the whole first
order approximation of the function at is
(3.19)
So, for the maximization of the function, the updating of can be accomplished along
the steepest ascent direction; i.e., where is the step size.
For the minimization of the function the updating rule can be along the steepest
descent direction; i.e., [Wid85]. If the gradient can be
expressed as the summation over data samples such as the case of the MSE as the cri-
terion , , then each datum can be used to
update the parameter whenever it appears; i.e., . This is
called the stochastic gradient method [Wid63].
Rm
R→ J f w( )=w∂
∂J
J w∂
∂J– J
w wn=
J f wn( ) w∆ T
w∂∂J
w wn=
+=
w
wn 1+ wn µw∂
∂J
w wn=
+= µ
wn 1+ wn µw∂
∂J
w wn=
–=w∂
∂J
J J n( )n 1=
N
∑= J n( ) 12--- d n( ) y n( )–( )2
=
w wn 1+ wn µw∂∂ J n( )±=
84
t’s
,
sitivity
gnal
ha-
more
nc-
.21)
ck to
ate an
be
ing
the
For a MLP network described above, the MSE criterion is still . Le
look at a simple case with only one output node ,
, , . Then by the chain rule, we have
(3.20)
We can see from this equation that the key point here is how to calculate the sen
of the network output . The term in the MSE case is the error si
. The sensitivity can then be regarded as a mec
nism which will propagate the error back to the parameters or . To be
specific, we have (3.21) if we consider the relation for a sigmoid fu
tion and apply the chain to the problem
(3.21)
where is the operator for component-wise multiplication. The process of (3
is a linear process which back-propagate through the “dual network” system ba
each parameter and thus is called “back-propagation.” If we need to back-propag
error , then the in of (3.21) will be replaced by , and (3.21) will
called the “error back-propagation.” Actually, the “error back-propagation” is noth
but the gradient method implementation with the calculation of the gradient by
J J n( )n 1=
N
∑=
y f vTz a+( )= v v1 … vl, ,( )T
=
z z1 … zl, ,( )T= zi f wi
Tx bi+( )= i 1 … l, ,=
v∂∂J
v∂∂ J n( )
n 1=
N
∑ y n( )∂∂ J n( )
v∂∂ y n( )
n 1=
N
∑= =
v∂∂ y n( )
y n( )∂∂ J n( )
y n( )∂∂ J n( ) e n( ) y n( ) d n( )–= =
e n( ) v wi
xddy
y 1 y–( )=
y f x( ) 1 1 ex–
+( )⁄= =
ξ n( ) 1 y n( ) 1 y n( )–( ) •=
v∂∂ n( ) ξ n( )z=
z∂∂ y n( ) ξ n( )v=
ζ n( )z n( )∂∂ y n( ) z n( ) 1 z n( )–( ) •=
wi∂∂ y n( ) ζi n( )x n( )=
•
1
e n( ) 1 ξ n( ) e n( )
85
paga-
fi-
fer to
nded
. The
cture
con-
namic
tic net-
chain rule applied to the network structure. The effectiveness of the “back-pro
tion” is its locality in calculation by utilizing the topology of the network. It is signi
cant for engineering implementations. For a detailed description, one can re
Rumelhart etal. [Ru86b, Ru86c].
Figure 3-8. The Time Extension of the Recurrent Neural Network in Figure 3-7.
For a dynamic system with delay connections, the whole network can be exte
along time with the delay connections linking the nodes between time slices
recurrent neural network in Figure 3-7 is shown in Figure 3-8, in which, the stru
in each time slice will only contain the instantaneous connections, and the delay
nections will connect the corresponding nodes between time slices. Once a dy
network is extended in time, the whole structure can be regarded as a large sta
x 1( )
y1 1( )
yk 1( )
z1 1( )
zl 1( )
x 2( )
y1 2( )
yk 2( )
z1 2( )
zl 2( )
x N( )
y1 N( )
yk N( )
z1 N( )
zl N( )
T 1= T 2= T N=
d d
d
86
her
rent
ent
differ-
egin-
k to
onal
based
form
olu-
ic
ec-
inite,
,
r a
work and the back-propagation algorithm can be applied as usual. This is so called the
“back-propagation through time” (BPTT) [Wer90, Wil90, Hay98]. There is anot
algorithm for the training of dynamic networks, which is called “real time recur
learning” (RTRL) [Wil89, Hay98]. Both the BPTT and the RTRL are the gradi
based method and both of them use the chain rule to calculate the gradient. The
ence is that the BPTT starts the chain rule from the end of a time block to the b
ning of it, while the RTRL starts the chain rule from the beginning of a time bloc
the end of it, resulting in differences of the memory complexity and computati
complexity [Hay98].
• Newton’s Method
The gradient method is based on the first order approximation of the performance sur-
face and is simple. But its convergence speed may be slow. Newton’s method is
on the second order approximation of the performance surface and the closed
optimization solution to a quadratic function. First, let’s look at the optimization s
tion to a quadratic function where is symmetr
matrix, it is either positive definite or negative definite, and are v
tors, is a scalar constant. There is an maximum solution if is negative def
or there is an minimum solution if is positive definite, where in both case
should satisfy the linear equation ; i.e., , or . Fo
general cost function , its second order approximation at will be
(3.22)
F x( ) 12---x
TAx h
Tx– c+= A R
m m×∈
h Rm∈ x R
m∈
c x0 A
x0 A x0
x∂∂ F x( ) 0= Ax h= x0 A
1–h=
J w( ) w wn=
J w( ) J wn( )w∂∂ J wn( )
Tw wn–( ) 1
2--- w wn–( )T
H wn( ) w wn–( )+ +=
87
ol-
to be
tion
is no
e defi-
onver-
Quasi-
.
where is the Hessian matrix of at . So, the optimization point
for (3.22) is . Thus we have Newton’s method as f
lows [Hes80, Hay98, Wid85]:
(3.23)
As pointed in Haykin [Hay98], there are several problems for Newton’s method
applied to the MLP training. For instance, Newton’s method involves the calcula
of the inverse of the Hessian matrix. It is computationally complex and there
guarantee that the Hessian matrix is nonsingular and always positive or negativ
nite. For a nonquadratic performance surface, there is no guarantee for the c
gence of Newton’s method. To overcome these problems, there appear the
Newton method [Hay98] and the conjugate gradient method [Hes80, Hay98], etc
• Quasi-Newton Method
This method uses an estimate of the inverse Hessian matrix without the calculation of
the real inverse. This estimate is guaranteed to be positive definite for a minimization
problem or negative definite for a maximization problem. However, the computational
complexity is still in the order of where is the number of parameters
[Hay98].
• The Conjugate Gradient Method
The conjugate gradient method is based on the fact that the optimal point of a qua-
dratic function can be obtained by a sequential searches along the so called conjugate
directions rather than the direct calculation of the inverse of the Hessian matrix. There
is a guarantee that the optimal solution can be obtained within steps for a quadratic
H wn( ) J w( ) w wn=
w wn– H wn( ) 1–
w∂∂ J wn( )–=
wn 1+ wn H wn( ) 1–
w∂∂ J wn( )–=
O W2( ) W
W
88
ugate
e cal-
hus is
cond-
.
function ( is the number of parameters). One method to obtain the conjugate direc-
tions is based on the gradient directions; i.e., the modification of the gradient direc-
tions may result in the one set of conjugate directions, thus the name “conj
gradient method” [Hes80, Hay98]. The conjugate gradient method can avoid th
culation of the inverse and even the evaluation of the Hessian matrix itself, and t
computational efficient. The conjugate gradient method is perhaps the only se
order optimization method which can be applied to large-scale problems [Hay98]
• The Natural Gradient Method
When a parameter space has a certain underlying structure, the ordinary gradient of a
function does not represent its steepest direction, but the natural gradient does. The
basic point of the natural gradient method is as follows [Ama98]:
For a cost function , if the small incremental vector is fixed with its length;
i.e., where is a small constant, then the steepest descent direction of
is and the steepest ascent direction is . However, if the length
of is constrained in such a way that the quadratic form where
is so called Riemannian metric tensor which is always positive definite, then the
steepest descent direction will be , and the steepest ascent direction will
be .
• The Expectation and Maximization (EM) Algorithm
The EM algorithm can be generalized and summarized as the following inequality
called the generalized EM inequality [XuD95], which can be described as follows:
W
J w( ) dw
dw2 ε2
= ε
J w( )w∂∂ J w( )–
w∂∂ J w( )
dw dw( )TG dw( ) ε2
=
G
G1–
w∂∂ J w( )–
G1–
w∂∂ J w( )
89
For a non-negative function , ,
is the data set, is the parameter set, we have
(3.24)
This inequality suggests an iterative method for the maximization of the function
with respect to the parameters , that is the generalized EM algorithm (all
functions and are not required to be a pdf function, as long as they
are non-negative functions). First, use the known parameters to calculate
and thus , this is so called expectation step
( can be regarded as a generalized expectation); Second, find
the maximum point for the expectation function , this
is so called maximization step. The process can go on iteratively.
With this inequality, it is not difficult to prove the Baum-Eagon inequality which is the
basis for the training of the well known hidden markov model. The Baum-Eagon ine-
quality can be stated as where is a polynomial with
nonnegative coefficients homogeneous of degree d in its variables ; is a
point in the domain PD: and
for all ; is another point in the PD satisfying
. If we regard as a parameter set, then
this inequality also suggests an iterative way to maximize the polynomial . That
is the above is a better estimation of parameters (better means makes the polynomial
larger) and the process can go on iteratively. The polynomial can also be non-homoge-
neous but with nonnegative coefficients. This is a general result which has been
f D θ,( ) fi D θ,( )i 1=
l
∑= fi D θ,( ) 0, D θ,( )∀≥
D di Rk∈ = θ
f D θn 1+,( ) f D θn,( )≥ , If θn 1+ maxargθ
fi D θn,( ) fi D θ,( )logi 1=
l
∑=
f D θ,( ) θ
fi D θ,( ) f D θ,( )
θn fi D θn,( )
fi D θn,( ) fi D θ,( )logi 1=
l
∑fi D θn,( ) fi D θ,( )log
i 1=
l
∑θn 1+ fi D θn,( ) fi D θ,( )log
i 1=
l
∑
P y( ) P x( )≥ P x( ) P xij ( )=
xij x xij =
xij 0≥ xijj 1=
qi
∑ 1= i 1 … p, ,= j 1 … qi, ,=
xij xij∂∂ P x( )
j 1=
qi
∑ 0≠ i y yij =
yij xij xij∂∂ P x( )
xij xij∂∂ P x( )
j 1=
qi
∑ 0≠
⁄= x
P x( )
y
90
applied to train such general model as the multi-channel hidden markov model
[XuD96], where the calculation of the gradient is still needed and which is
accomplished by the back-propagation through time. So, the forward and backward
algorithm in the training of the hidden markov model can be regarded as the forward
process and back-propagation through time for the hidden markov network [XuD96].
The details about the EM algorithm can be found in Dempster and McLachlan [Dep77,
Mcl96].
3.3 General Point of View
It can be seen from the above that there are variety of learning criteria. Some of them
are based on energy quantities, some of them are based on information-theoretic mea-
sures. In this chapter, a unifying point of view will be given
3.3.1 InfoMax Principle
In the late 1980s, Linsker gave a rather general point of view about learning or statisti-
cal signal processing [Lin88, Lin89]. He pointed out that the transformation of a random
vector observed at the input layer of a neural network to a random vector produced at
the output layer of the network should be so chosen that the activities of the neurons in the
output layer jointly maximize information about the activities in the input layer. To
achieve this, the mutual information between the input vector and the output
vector should be used as the cost function or criteria for the learning process of the neu-
ral network. This is called the InfoMax principle. The InfoMax principle provides a math-
ematical framework for self-organization of the learning network that is independent of
the rule used for its implementation. This principle can also be viewed as the neural net-
xij∂∂ P x( )
X Y
I Y X,( ) X
Y
91
work counterpart of the concept of channel capacity, which defines the Shannon limit on
the rate of information transmission through a communication channel. The InfoMax prin-
ciple is depicted in the following figure:
Figure 3-9. InfoMax Scheme
When the neural network or mapping system is deterministic, the mutual information
is determined by the output entropy as it can be shown by
where is the output entropy, and is the conditional output entropy
when the input is given (since the input-output relation is deterministic, the conditional
entropy is zero). So, in this case, the maximization of mutual information is equivalent to
the maximization of the output entropy.
3.3.2 Other Similar Information-Theoretic Schemes
Haykin summarized other information-theoretic learning schemes in [Hay98], which
all use the mutual information as the learning criteria but the schemes are formulated in
different ways. There are three other different scenarios which are described in the follow-
ing. Although the formulations are different, the spirit is the same as the InfoMax princi-
ple [Hay98].
Neural NetworkInput X Output Y
Maximization of I Y X,( )
I Y X,( ) H Y( ) H Y X( )–=
H Y( ) H Y X( ) 0=
92
• Maximization of the Mutual Information Between Scalar Outputs
As depicted in Figure 3-10, the objective of this learning scheme is to maximize the
mutual information between two scalar outputs such that the output will convey
most information about and vice versa. The example of this scheme is the spatially
coherent feature extractor [Bec89, Bec92, Hay98], where as depicted in Figure 3-11,
the transformation of a pair of vectors and (representing adjacent, nonoverlap-
ping regions of an image by a neural system) should be so chosen that the scalar output
of the system due to the input maximizes information about the second scalar
output due to .
Figure 3-10. Maximization of the Mutual Information between Scalar Outputs
Figure 3-11. Processing of two Neighboring Regions of an Image
ya
yb
Xa Xb
ya Xa
yb Xb
Neural Network
Inputs
Xa
Xb
Outputs
ya
yb
MaximizationMutual Information
I ya yb,( )
Region
Region
aa
b
Neural Network
bNeural Network
ya
yb
Maximize
Mutual Information
I ya yb,( )
93
Figure 3-12. Minimization of the Mutual Information between Scalar Outputs
• Minimization the Mutual Information between Scalar Outputs
Similar to the previous scheme, this scheme is trying to make the two scalar outputs to
be the most irrelevant. The example of this scheme is the spatially incoherent feature
extractor [Ukr92, Hay98]. As depicted in Figure 3-13, the transformation of a pair of
input vectors and , representing data derived from corresponding regions in a
pair of separate images, by a neural system should be so chosen that the scalar output
due to the input minimize information about the second scalar output due to
the input , and vice versa.
Figure 3-13. Spacially Inchoherent Feature Extraction
Neural Network
Inputs
Xa
Xb
Outputs
ya
yb
MinimizationMutual Information
I ya yb,( )
Xa Xb
ya Xa yb
Xb
Xb
(Horizontal-Vertical)Radar Input
G
G
Xa
(Horizontal-Horizontal)Radar Input
G
G
Gaussian RBF
ya
yb
Minimize
Mutual
Information
I ya yb,( )
94
Figure 3-14. Minimization Mutual Information among Outputs
• Statistical Independence between Outputs
This scheme requires that all the outputs of the system are independent with each
other. The examples for this scheme are all systems for Blind Source Separation and
Independent Component Analysis described in the previous chapters, where usually
the systems are full rank linear networks.
Figure 3-15. A General Learning Framework
Neural Network
Inputs
X
Outputsy1
yk
MinimizationMutual Information
I y1 … yk, ,( )
Learning System
Y q X W,( )=Input Signal Output Signal
X Y
Desired Signal D
OptimizationInformation Measure
I Y D,( )
95
3.3.3 A General Scheme
As can be seen from the above, all the existing learning schemes are by no means gen-
eral. The InfoMax principle deals with only the mutual information between the input and
the output, although it motivated the analysis of a learning process from information-theo-
retic angle. The other schemes summarized by Haykin are also some specific cases even
with the limitation of model linearity and Gaussian assumption. These learning schemes
have not considered the case with external teacher signals, i.e. the supervised learning
case. In order to unify all the schemes, a general learning framework is proposed here.
As depicted in Figure 3-15, this general learning scheme is nothing but the abstract
and general learning model described in Chapter 1 with the specification of the learning
mechanism as the optimization of the information measure based on the response of the
learning system and the desired or teacher signal . If the desired signal is the input
signal and the information measure is the mutual information, then this scheme degen-
erate the InfoMax principle. If the desired signal is one or some of the output signals,
then this scheme degenerates the schemes summarized by Haykin and the case of BBS
and ICA. Ever for a supervised learning case, where there is an external teacher signal ,
the mutual information between the response of the learning system and the desired sig-
nal can be maximized under this scheme. That means, in general, the purpose of learn-
ing is to transmit as much information about the desired signal as possible in the output
or response of the learning system . The extensively used MSE criterion, this scheme is
still contained in this scheme, where the difference signal or error signal is assumed
white Gaussian with zero mean, and the minimization of the entropy of the error signal is
equivalent to the minimization of the MSE criterion according to Chapter 2.
Y D D
X
D
D
Y
D
D
Y
Y D–
96
In this learning scheme, the supervised learning can be defined as the case with an
external desired signal. In this case, the order of the learning system appears such that its
response best represents the desired signal. If the desired signal is either the input of the
system or the output of the system, this scheme becomes unsupervised learning, where the
system will self-organize such that either the output signal best represent the input signal,
or the outputs are independent with each other or highly related with each other. The fol-
lowing will give two specific cases of this general point of view.
Figure 3-16. Learning as Information Transmission Layer-by-Layer
3.3.4 Learning as Information Transmission Layer-by-Layer
For a layered network, each layer can itself be regarded as a learning system. The
whole system is the concatenation of each layer. From the above general point of view, if
the desired signal is either an external one or the input signal, then each layer should serve
the same purpose for the learning as to transmit as much information about the desired sig-
nal as possible. In this way, the whole learning process is broken down to several small
scale learning processes and each small learning process can proceed sequentially. This is
an alternative learning scheme for a layered network where the back-propagation learning
X Y1 Yk
D
max I Y1 D,( ) max I Yk D,( )
97
ious
ec-
such
spec-
ro-
ls in
tima-
chal-
ented
ross
algorithm has dominated for more than 10 years. The layer-by-layer learning scheme may
simplify the whole learning process and shed more light into the essence of the learning
process in this case. The scheme is shown in the following figure. Examples of the appli-
cation of such learning scheme will be given in Chapter 5.
3.3.5 Information Filtering: Filtering beyond Spectrum
Traditional filtering is based on the spectrum, i.e. an energy quantity. The basic inter-
est of traditional filtering is to find some signal components or signal subspace according
to the spectrum. From the information-theoretic point of view, the signal components or
signal subspace, linear or nonlinear, should be chosen not in the domain of the spectrum
but in the domain of “the signal information structure.” A signal may contain var
kinds of information. The list of various information will be so called “information sp
trum.” It is more desired to choose signal components or subspace according to
“information spectrum” than to choose signal components according to the energy
trum which is the traditional way of filtering. The idea of the information filtering p
posed here will generalize the traditional way of filtering and bring more powerful too
the signal processing area. Examples of information filtering application to pose es
tion of SAR (synthetic aperture radar) image will be given in Chapter 5.
3.4 Learning by Information Force
The general point of view is important, but the practical implementation is more
lenging. In this section, we will see how the general learning scheme can be implem
or further specified by using the powerful tool of the information potential and the c
information potential. The general learning scheme can be depicted as
98
Figure 3-17. The General Learning Scheme by Information Potential
In the general learning scheme depicted in Figure 3-17, if the information measure
used is the entropy, then the information potential can be used; if the information measure
is the mutual information, then the cross information potential can be used. So, the infor-
mation potential in Figure 3-17 is a general term which stands for both the narrow sense
information potential and the cross information potential. We may call such a general term
as the general information potential.
Given a set of environmental data , there will be the
response data set , then the general information
potential can be calculated according to the formula in Chapter 2. To optimize
, the gradient method can be used. Then the gradient of with
respect to the parameters of the learning system and the learning of the system will be
(3.25)
Learning System
Y q X W,( )=Input Signal Output Signal
X Y
Desired Signal D
Information Potential
Field
Dual System
Information Force
Back-Propagation
x n( ) d n( ),( ) n 1 … N, ,=
y n( ) n 1 … N, ,= y n( ) q x n( ) w,( )=
V y n( ) ( )
V y n( ) ( ) V y n( ) ( )
w∂∂ V y n( ) ( )
y n( )∂∂ V y n( ) ( )
w∂∂ y n( )
n 1=
N
∑=
w w ηw∂∂ V y n( ) ( )±=
99
ch as
field,
then
ch will
on is
s the
e field
raliza-
milar
As described in Chapter 2, is the information force that the informa-
tion particle receives in the information potential field. As pointed out in the above
is the sensitivity of the learning network output and it serves as the mechanism of
error back-propagation in the error back-propagation algorithm. Here, (3.25) can be inter-
preted as “information force back-propagation.” So, from a physical point of view su
a mass-energy point of view, the learning starts from the information potential
where each information particle receives the information force from the field, which
transmits through the network to the parameters so as to drive them to a state whi
make the information potential be optimized. The information force back-propagati
illustrated in Figure 3-18 where the network functions as a “lever” which connect
parameters and data samples (information particles) and transmit the force that th
impinges on the information particles to the parameters.
Figure 3-18. Illustration of Information Force Back-Propagation
3.5 Discussion of Generalization by Learning
The basic purpose of learning is to generalize. As pointed out in Chapter 1, gene
tion is nothing but to make full use of the information given, neither less nor more. Si
y n( )∂∂ V y n( ) ( )
y n( )
w∂∂ y n( )
Parameters
Network
“Information Force”
Data Sample
y n( )∂∂ V y n( ) ( )
100
“The
aliza-
found
now’
e.”
n we
hine
king
not
ernal
more
y the
relative
reated
ector
r, and
orting
ental
ener-
point of view can be found in Christensen [Chr80: page vii] where he pointed out:
generalizations should represent all of the information which is available. The gener
tions should represent no more information than is available.”Ideas of this kind are
in ancient wisdom. The ancient Chinese philosopher Confucius pointed out: “Say ‘k
when you know; say “don’t know” when you don’t know, that is the real knowledg
Although Confucius’ word is about the right attitude that a scholar should take, whe
are thinking about machine learning today, this is still the right “attitude” that a mac
should take in order to obtain information from its environment.
The information potential provides a powerful tool to achieve the balance of ma
full use of given information while avoiding explicit or implicit assumptions that are
given. To be more specific. the information potential does not rely on any ext
assumption and its formulation tells us that it examines each pair of data, extracting
detailed information from the data set than the traditional MSE criterion where onl
relative position between each data sample and their mean is considered and the
position of each pair of data samples is obviously ignored and thus they can be t
independently. In this aspect, the information potential is similar to the supporting v
machine [Vap95, Cor95], where a maximum margin is pursued for a linear classifie
for this purpose, the detailed data distribution information is also needed. The supp
vector machine has shown to have a very good generalization ability. The experim
results in Chapter 5 will also show that the information potential have a very good g
alization too, and even better result than supporting vector machine.
ay
h more
neu-
anti-
signal,
ptation
role of
rmu-
own
ebbian
CHAPTER 4
LEARNING WITH ON-LINE LOCAL RULE:A CASE STUDY ON GENERALIZED EIGENDECOMPOSITION
In this chapter, the issue of learning with on-line local rules will be discussed. As
pointed out in Chapter 1, learning or adaptive evolution of a system can happen whenever
there are data flowing into the system, and thus should be on-line. For a biological neural
network, the strength of a synaptic connection will evolve only with its input and output
activities. For a learning machine, although the features of “on-line” and “locality” m
not be necessary in some cases, a system with such features will certainly be muc
appealing. The Hebbian rule is the well-known postulated rule for the adaptation of a
robiological system [Heh49]. Here, it will be shown how the Hebbian rule and the
Hebbian rule can be mathematically related to the energy and cross correlation of a
and how these simple rules can be combined together to achieve on-line local ada
for a problem as intricate as generalized eigendecomposition. We will again see the
the mass-energy concept.
4.1 Energy, Correlation and Decorrelation for Linear Model
In Chapter 3, a linear model is introduced, where the input-output relation is fo
lated in (3.1) and the system is illustrated in Figure 3-1. In the following, it will be sh
how the energy measure of a linear model can be related to Hebbian and anti-H
learning rule.
101
102
th the
he
he
epest
ical
These
4.1.1 Signal Power, Quadratic Form, Correlation, Hebbian and Anti-Hebbian Learning
In Figure 3-1, the output signal in the ith node is . So, given a data set
, the power of the output signal is the quadratic form:
(4.1)
where the covariance matrix of the input signal is estimated from samples and is
the time index. One of the consequences of the quadratic form of (4.1) is that it can be
interpreted as a field in the space of the weights. The change in the power “field” wi
projection is shown in Figure 4-1 where “ “ are hyper-ellipsoids. T
normal vector of the surface “ “ is which is proportional to (t
gradient of ) This means that the normal vector is the direction of the ste
ascent of the power .
Figure 4-1. The power “field” P of the input signal
The Hebbian and the anti-Hebbian learning, although initially motivated by biolog
considerations [Heb49], happen to be consistent with the normal vector direction.
rules can be summarized as follows:
yi wiTx=
x n( ) n 1 … N, ,= yi
P1N---- yi n( )2
n 1=
N
∑ wiTSwi= = S, E xx
T 1N---- x n( )x n( )T
n 1=
N
∑= =
S n
wi P constant=
P constant= Swi Pwi∇
P Swi
P
P wiTSwi constant= =
Swi Pwi∇∝
103
(4.2)
(4.3)
where the adjustment of the projection should be proportional to the input and output
signal correlations for Hebbian learning (or the negative of the correlation for the anti-
Hebbian learning). So, the direction of Hebbian batch learning is actually the direction of
the fastest ascent in the power field of the output signal, while the anti-Hebbian batch
learning moves the system weights in the direction of the fastest descent of the power
field. The sample-by-sample Hebbian and anti-Hebbian learning rules are just the stochas-
tic versions for their corresponding batch mode learning rules. Hence, these simple rules
are able to seek both the directions of the steepest ascent and descent in the input power
field using only local information.
4.1.2 Lateral Inhibition Connections, Anti-Hebbian Learning and Decorrelation
Lateral inhibition connections adapted with the anti-Hebbian learning are known to
decorrelate signals. As shown in Figure 4-2, is the lateral inhibition connection from
to , , . The cross-correlation between and is as (4.4) (note
the upper denotes the cross-correlation, and the lower denotes the lateral inhibition
connections).
Hebbian
wi n( ) yi n( )x n( )∝∆ Sample by Sample Mode
wi n( ) yi
n 1=
N
∑ n( )x n( ) Swi Batch Mode∝ ∝∆
Anti-Hebbian
wi n( ) y– i n( )x n( ) Sample by Sample Mode∝∆
wi n( ) yi
n 1=
N
∑– n( )x n( ) S– wi Batch Mode∝ ∝∆
wi
c yi+
yj+
yi yi+
= yj cyi+
yj+
+= yi yj
C c
104
(4.4)
Figure 4-2. Lateral Inhibition Connection
Assume the energy of the signal , , is always greater than . Then, there
always exists a value
(4.5)
which will make ; i.e., decorrelate signal and .
The anti-Hebbian learning requires the adjustment of to be proportional to the nega-
tive of the cross-correlation between the output signals, as (4.6) shows
(4.6)
where is the learning step size. Accordingly, we have (4.7) for the batch mode.
(4.7)
C yi yj,( ) yi n( )yj n( )n∑ c yi
+n( )
2
n∑ yi
+n( )yj
+n( )
n∑+= =
yi yj
yi+
yj+
c
yi yi n( )2
n∑ 0
c yi+
n( )yj+
n( )n∑
yi n( )2
n∑
⁄–=
C yi yj,( ) 0= yi yj
c
c∆ η yi n( )yj n( )( ) Sample by Sample mode–=
c∆ ηC yi yj,( )– η yi n( )yj n( ) Batch Moden∑–= =
η
C∆ c∆( ) yi+
n( )2
n∑ ηEC–= = ( E yi n( )2
n∑ 0 )>=
105
ade to
thods
] use
7a,
ia96]
ralized
It is obvious that is the only fixed stable atractor for the dynamic process
. So, the anti-Hebbian learning will converge to decorrelate the signals as
long as the learning step size is small enough.
Summarizing the above, we can say that for a linear projection, the Hebbian learning
tends to maximize the output energy while the anti-Hebbian learning tends to minimize
the output energy, and for a lateral inhibition connection, the anti-Hebbian learning tends
to minimize the cross-correlation between the two output signals.
4.2 Eigendecomposition and Generalized Eigendecomposition
Eigendecomposition and generalized eigendecomposition arise naturally in many sig-
nal processing problems. For instance, principal component analysis (PCA) is basically an
eigenvalue problem with wide application in data compression, feature extraction and
other areas [Kun94, Dia96]; as another example, Fisher linear discriminant analysis
(LDA) is a generalized eigendecomposition problem [Dud73, XuD98]; signal detection
and enhancement [Dia96] and even blind source separation [Sou95] can also be related to
or formulated as an eigendecomposition or generalized eigendecomposition. Although the
solutions based on numerical methods have been well studied [Gol93], adaptive, on-line
solutions are more desirable in many cases [Dia96]. Adaptive on-line structures and meth-
ods such as Oja’s rule [Oja82] and the APEX rule [Kun94] emerged in the past dec
solve the eigendecomposition problem. However, the study of adaptive on-line me
for generalized eigendecomposition is far from satisfactory. Mao and Jain [Mao95
two steps PCA for LDA which is clumsy and not efficient;. Principe and Xu [Pr9
Pr97b] only discuss the two-class constrained LDA case; Diamantaras and Kung [D
describe the problem as oriented PCA and present the rule only for the largest gene
0
dC dt⁄ EC–=
η
106
eigenvalue and its corresponding eigenvector. More recently, Chatterjee etal. [Cha97] for-
mulate LDA from the point of view of heteroassociation and provided an iterative solution
with the proof of convergence for its on-line version. But the method does not use local
computations and is still computationally complex. Hence a systematic, on-line local algo-
rithm for the generalized eigendecomposition in not presently available. In this chapter, an
on-line local rule to adapt both the forward and lateral connections of a single layer net-
work is proposed which produces generalized eigenvalues and the corresponding eigen-
vectors in descending orders. The problem of the eigendecomposition and the generalized
eigendecomposition will be formulated here in a different way which will lead to the pro-
posed solutions. An information-theoretic problem formulation for the eigendecomposi-
tion and the generalized eigendecomposition will be given in the following first, and then
the formulation based on the energy measures for eigendecomposition and the generalized
eigendecomposition.
4.2.1 The Information-Theoretic Formulation for Eigendecomposition and Generalized Eigendecomposition
As pointed out in Chapter 3, the first component of the PCA can be formulated as
maximizing an entropy difference, and the first component of the generalized eigende-
composition can also be formulated as maximizing an entropy difference. Here, more gen-
eralized formulations will be given.
Suppose there are one zero-mean Gaussian signal , , with
covariance matrix (the trivial constant scalar is
ignored here for convenience) and one zero-mean white Gaussian noise with covariance
matrix as the identity matrix . After the linear transform shown in Figure 3-1, the signal
x n( ) Rm∈ n 1 … N, ,=
S E xxT( ) x n( )x n( )T
n 1=
N
∑= = 1 N⁄
I
107
and the noise will still be Gaussian signal and noise with covariance matrices as
and respectively. The entropies for the outputs when the input are the signal and the
noise will be the following according to (2.42) in Chapter 2:
(4.8)
If we are going to find a linear transform such that the information about the signal at the
output end, i.e. , is maximized while the information about the noise at the output
end, i.e. , is minimized at the same time, the entropy difference can be used
as the maximization criterion:
(4.9)
equivalently,
(4.10)
This problem is not a easy one but has been studied before. Fortunately, the solution
turns out to be the eigenvectors of with the largest eigenvalues [Wil62, Dud73]:
(4.11)
So, the eigendecomposition can be regarded as finding a linear transform in the case of
Gaussian signal and Gaussian noise such that the entropy difference in the output is maxi-
mized; i.e., the output information entropy of the signal is maximized while the output
information entropy of the noise is minimized at the same time. One may note that the
Renyi’s entropy will lead to the same result.
wTSw
wTw
H wTx( ) 1
2--- w
TSwlog
k2--- 2πlog
k2---+ +=
H wTnoise( ) 1
2--- w
Twlog
k2--- 2πlog
k2---+ +=
H wTx( )
H wTnoise( )
J H wTx( ) H w
Tnoise( )–
12--- w
TSw
wTw
-----------------log= =
Jw
TSw
wTw
-----------------=
S
Swi λiwi i 1 … k , k can be from 1 to m, ,= =
108
neral-
ame as
t end
Similarly, for the generalized eigendecomposition, suppose there are two zero-mean
Gaussian signals , , , with covariance matrices as
= , and respectively (the trivial con-
stant scalar is ignored for convenience). The outputs after the linear transform will
still be Gaussian signals with zero-mean and the covariance matrices as and
respectively. So the output information entropy for these two signals will be
(4.12)
If we are looking for a linear transform such that at the output, the information about
the first signal is maximized while the information about the second signal is minimized,
then we can use the entropy difference as the maximization criterion. In this case, the
entropy difference will be (both Shannon’s entropy and Renyi’s entropy)
(4.13)
equivalently,
(4.14)
Again, this is not a easy problem. Fortunately the solution turns out to be the ge
ized eigenvectors with the largest generalized eigenvalues [Wil62, Dud73] as
(4.15)
So, in the case of Gaussian signals, the generalized eigendecomposition is the s
finding a linear transform such that the information about the first signal at the outpu
x1 n( ) x2 n( ) n 1 … N, ,= S1 E x1x1T[ ]=
x1 n( )x1 n( )T
n 1=
N
∑ S2 E x2x2T[ ] x2 n( )x2 n( )T
n 1=
N
∑= =
1 N⁄
wTS1w
wTS2w
H wTx1( ) 1
2--- w
TS1wlog
k2--- 2πlog
k2---+ +=
H wTx2( ) 1
2--- w
TS2wlog
k2--- 2πlog
k2---+ +=
J12---
wTS1w
wTS2w
-------------------log=
Jw
TS1w
wTS2w
-------------------=
S1wi λiS2wi , i 1 … k , k can be from 1 to m, ,= =
109
is maximized while the information about the second signal at the output end is mini-
mized.
4.2.2 The Formulation of Eigendecomposition and Generalized Eigendecomposition Based on the Energy Measures
Based on the energy criterion, the eigendecomposition can also be formulated as find-
ing linear projections ( from to ) (Figure 3-1) which maxi-
mize the criteria in (4.16),
(4.16)
where are the projections which maximize .
Obviously, when , there is no constraint for the maximization of (4.16). Using
Lagrange Multipliers we can verify that the solutions ( ) of the optimization are
eigenvectors and eigenvalues which satisfy where are eigenvalues of
in descending order. From section 4.1, we know that the numerator in (4.16) is the power
of the output signal of the projection when the input is applied. The denominator can
actually be regarded as the power of a white noise source applied to the same linear pro-
jection in the absence of since where is the identity matrix, i.e. the
covariance matrix of the noise. So, the eigendecomposition is actually the optimization of
a signal-to-noise ratio (maximizing the signal power with respect to an alternate white
noise source applied to the same linear projection), which is an interesting observation for
signal processing applications.
The constraints in (4.16) simply require the orthogonality of each pair of projections.
Since are eigenvectors of , equivalent constraints can be written as
wi Rm∈ i, 1 … k, ,= k 1 m
J wi( )wi
TSwi
wiTwi
--------------- subject to wiTwj
o0 j, 1 … i 1–, ,= = =
wjo
Rm∈ J wj( )
i 1=
λi J wio( )=
Swio λiwi
o= λi S
wi
x n( ) wiTwi wi
TIwi= I
wjo
S
110
(4.17)
which means exactly the decorrelation between each pair of output signals. This deriva-
tion can be summarized by saying that an eigendecomposition finds a set of projections so
that the outputs are most correlated with the input while the outputs themselves are decor-
related with each other.
Similarly, the criterion in (4.14) is equivalent to the following criteria [Wil62, Dud73,
XuD98].
Let be two zero-mean ergodic stationary random
signals. The auto-correlation matrix can be estimated by
. The problem is to find ( from to )
which maximize
(4.18)
where is the j-th optimal projection vector which maximizes , in the constraints
can be either or or . are assumed positive definite. Obviously, when
, there is no constraint for the maximization of (4.18). After is obtained,
( ) will be obtained sequentially in a descending order of . Using
Lagrange Multipliers we can verify that the optimization solutions ( ) are
generalized eigenvalues and eigenvectors satisfying which can be used to
justify the equivalence of three alternative choices of . In fact, and
, thus any of the three choices will result in the others
and are equivalent. This is why the problem is called generalized eigendecomposition.
wiTwj
oλj wiTSwj
oyi n( )yj
on( )
n∑ 0= = =
xl n( ) Rm∈ n, 1 2 … l, , , 1 2,= =
E xl n( )xl n( )T
Sl xl n( )xl n( )T
n∑= vi R
m∈ i, 1 … k, ,= k 1 m
J vi( )vi
TS1vi
viTS2vi
--------------- subject to viTSvj
o0 j, 1 … i 1–, ,= = =
vjo
J vj( ) S
S1 S2 S1 S2+ S1 S2,
i 1= v1o
vio
i 2 … k, ,= J vio( )
λi J vio( )= 0>
S1vio λiS2vi
o=
S viTS1vj
ovi
TS2vj
oλj=
viT
S1 S2+( )vjo
viTS2vj
o1 λj+( )=
111
ction
of the
of the
vector
l ,
utput
t of
hoose
Let denotes the i-th output when the input is , then
is the energy of the i-th output and is the
cross-correlation between i-th and j-th outputs when the input is . This suggests that the cri-
teria in (4.18) are energy ratios of two signals after projection, where the constraints simply require
the decorrelation between each pair of output signals. Therefore the problem is formulated as an
optimal signal-to-signal ratio with decorrelation constraints.
4.3 The On-line Local Rule for Eigendecomposition
4.3.1 Oja’s Rule and the First Projection
As mentioned above, there is no constraint for the optimization of the first proje
for the eigendecomposition and the criterion is to let the output energy (or power)
signal to be as large as possible while letting the energy (or power) of the output
white noise to be as small as possible. By the result in 4.1, we know that the normal
is the steepest ascent direction of the output energy when the input is the signa
while the normal vector is the steepest descent direction of the o
energy when the input is the white noise. Thus, we can postulate that the adjustmen
should be a combination of two normal vectors and :
(4.19)
where is a positive scalar which balance the roles of two normal vectors. If we c
, then (4.19) is the gradient method. The choice
will lead to the so-called Oja’s rule [Oja82]:
(4.20)
yil n( ) viTxl n( )= xl n( )
viTSlvi yil n( )2
n∑= vi
TSlvj yil n( )yjl n( )
n∑=
xl n( )
Sw1 x n( )
Iw1– w1–=
w1
Sw1 w1–
w1 Sw1 aIw1–∝∆ Sw1 aw1–=
a
a J w1( ) w1TSw1 w1
Tw1⁄= = a w1
TSw1=
w1 Sw1 w1TSw1( )w1–∝∆ y1 n( ) x n( ) y1 n( )w1–[ ] Batch Mode
n∑=
w1∆ y1 n( ) x n( ) y1 n( )w1–[ ] Sample-by-Sample Mode∝
112
alue
roof
tion to
is
ule
n
).
indi-
the
(a).
com-
ion of
Oja’s rule will make converge to , the eigenvector with the largest eigenv
of , and also make converge to ; i.e., [Oja82]. The convergence p
can be found in Oja [Oja82]. In the next section, we present a geometrical explana
the above rule so that its convergence can be easily understood.
Figure 4-3. Geometrical Explanation to Oja’s Rule
4.3.2 Geometrical Explanation to Oja’s Rule
When , the balancing scalar in Oja’s rule
. So, in this case, the updating term of the Oja’s r
= is the same as the gradient directio
which is always perpendicular to (because
This is also true even for the sample-by-sample case where (all the
ces are ignored for convenience). When , obviously ; i.e.,
direction of the updating vector is perpendicular to as shown in Figure 4-3
So in general, the updating vector in Oja’s rule can be decomposed into two
ponents, one is the gradient component and the other is along the direct
the vector (as shown in Figure 4-3 (b) and (c)):
(4.21)
w1 w1o
S w1 1 w1 1→
ww⊥∆ ww
⊥∆
ww∆ ww∆
x x xx yw–x yw– x yw–
w w wyw ywyw
c( ) w 1<a( ) w 1= b( ) w 1>
w1 1=
a w1TSw1 w1
TSw1( ) w1
Tw1( )⁄= =
w1 Sw1 aw1–∝∆ y1 n( ) x n( ) y1 n( )w1–[ ]n∑
w w1T
Sw1 w1TSw1 w1
Tw1( )⁄( )w1–( ) 0=
w y x yw–[ ]∝∆
w 1= wT
x yw–( ) 0=
x yw– w
x yw–
ww⊥∆ ww∆
w
w ww⊥∆ ww∆+∝∆
113
the
16) is
will
simple
liza-
thod.
hods
.
ojec-
vious
ind the
d. By
(4.22)
The gradient component will force towards the right direction, i.e. the eigen-
vector direction, while the vector component adjusts the length of . As shown in
Figure 4-3 (b) and (c), when , it tends to decrease , when , it tends to
increase . So, it serves as a negative feedback control for and the equilibrium
point is . Therefore, even without the explicit normalization of the norm of ,
Oja’s rule will still force to be . Unfortunately, when Oja’s rule is used for
minor component (the eigenvector with smallest eigenvalue, where the criteria in (4.
to be minimized), the updating of becomes anti-Hebbian type. In this case,
serve as a positive feedback control for and Oja’s rule becomes unstable. One
method to stablize Oja’s rule for minor components is to perform an explicit norma
tion for the norm of so that Oja’s rule is exactly equivalent to gradient descent me
In spite of the normalization , this method is compatible to the other met
in computational complexity because all the methods needs to calculate the value of
4.3.3 Sanger’s Rule and the Other Projections
For the other projections, the difference is the constraint in (4.16). For the i-th pr
tion, we can project the normal vector to the subspace orthogonal to all the pre
eigenvectors to meet the constraint and apply the Oja’s rule in the subspace to f
optimal signal-to-noise ratio in that subspace. This is called the deflation metho
using the concept of the deflation method, Sanger [San89] proposed the rule in
which will degenerate to the Oja’s rule when :
(4.22)
ww⊥∆ w
ww∆ w
w 1> w w 1<
w w
w 1= w
w 1
w ww∆
w
w
w w w⁄=
wTw
Swi
wjo
i 1=
wi I wjwjT
j 1=
i 1–
∑–
Swi wiTSwi( )wi–∝∆
114
rst
is
argest
ore,
er of
les for
rojec-
syn-
f an
ule.
hose
. Start-
pro-
where is the projection transform to the subspace perpendicular to all the
previous , . According to the Oja’s rule, will converge to the fi
eigenvector with the largest corresponding eigenvalue and . Based on th
and the rule in (4.22), will converge to the second eigenvector with the second l
eigenvalues and . Similar situation will happen for the rest of . Theref
Sanger’s rule will sequentially result in the eigenvectors of in the descending ord
their corresponding eigenvalues.
The corresponding batch mode adaptation and sample-by-sample adaptation ru
Sanger’s method are
(4.23)
Sanger’s rule is not local because the updating of involves all the previous p
tions and their outputs . In a biological neural network, the adaptation of the
apses should be local. In addition, the locality will make the VLSI implement o
algorithm much easier. We next will introduce the local implementation of Sanger’s r
4.3.4 APEX Model: The Local Implementation of Sanger’s Rule
As stated above, the purpose of eigendecomposition is to find the projections w
outputs are most correlated with the input signals and decorrelated with each other
ing from this point and considering the results in 4.1, the structure in Figure 4-4 is
posed.
I wjwjT
j 1=
i 1–
∑–
wj j 1 … i 1–, ,= w1
w1 1→ w1
w2
w2 1→ wi
S
wi∆ yi n( ) x n( ) Σj 1=
i 1–yj n( )wj yi n( )wi––
n∑∝ Batch Mode
wi∆ yi n( ) x n( ) Σj 1=
i 1–yj n( )wj yi n( )wi–– Sample-by-Sample Mode∝
wi
wj yj
115
erged
cond
ilarly
is
t the
Figure 4-4. Linear Projections with Lateral Inhibitions
In Figure 4-4, are lateral inhibition connections expected to decorrelate the output
signals. The input-output relation for the i-th projection is
(4.24)
So, the overall i-th projection is and the input-output relation can be
. For the simplicity of exposition, we will just consider the second projection
(For the first projection , we already have Oja’s rule, suppose it has already conv
to the solution--the eigenvector with the largest eigenvalue of and fixed). The se
projection will represent all the other projections (the rule for all the rest can be sim
obtained). For the structure in Figure 4-4, the overall second projection
. The problem can be restated as finding the projection such tha
following criterion is maximized.
w2w1 wk
x
y1yky2 c1k
c12c2k
cij
yi wiTx cjiyj
j 1=
i 1–
∑+ wi cjiwjj 1=
i 1–
∑+ T
x= =
vi wi cjiwjj 1=
i 1–
∑+=
yi viTx=
w1
S
v2 w2 c12w1+= v2
116
e time
d .
cross
(4.25)
where is the solution for the first projection, i.e. the eigenvector with the largest eigen-
value of and can be assumed fixed during the adaptation of the second projection. The
overall change of can result from the variation of both forward projection and the
lateral inhibition connection ; i.e., we have
(4.26)
To let the problem be further tractable, we will consider how the overall projection
should change if we fix , and how it should change if we fix . By the basic principle
in 4.1 (that is, using the Hebbian rule to increase an output energy and using the anti-Heb-
bian rule to decrease an output energy), if is fixed, the overall projection should
evolve according to Oja’s rule so as to increase the energy and at the sam
decrease the :
(4.27)
However, is a virtual projection and relies on both and . In this case when
is fixed, . So, (4.27) can be implemented by (4.28):
(4.28)
When is fixed, the adaptation of should decorrelate the two signals an
According to the conclusion in 4.1 (i.e. using the anti-Hebbian rule to decrease the
correlation between two outputs as in Figure 4-4), the adaptation of should be
(4.29)
J v2( )v2
TSv2
v2Tv2
-------------- , subject to v2TSw1 0= =
w1
S
v2 w2
c12
v2∆ w2∆ c12∆( )w1+=
v2
c12 w2
c12
v2TSv2
v2Tv2
v2∆ Sv2 v2TSv2( )v2–=
v2 c12 w2 c12
v2∆ w2∆=
w2∆ Sv2 v2TSv2( )v2–=
w2 c12 y1 y2
c12
c12∆ y1 n( )y2 n( )n∑– w1
TSv2–= =
117
if we
22).
; i.e.,
ch is
t the
lly be
r both
So, by the principle in 4.1, we can postulate the adaptation rule as (4.28) and (4.29)
together:
(4.30)
Surprisingly, we may find out that this rule is actually the same as the Sanger’s rule
write down the adaptation for the overall projection as (4.31) and compare it with (4.
(4.31)
However, from (4.30) we can see that the adaptation of is not local either
depends not only on its input, output and itself, but also on and whi
contained in the last term of in (4.30). The last term of in (4.30) means tha
part of the adaptation of should be along the direction of . And this can actua
implemented by adapting the lateral inhibition connection ; i.e., the last term of
in (4.30) can be put in instead of . From (4.30), we have
(4.32)
To keep the adaptation of unchanged, we can write new adaptation rules fo
and as
w2∆ Sv2 v2TSv2( )v2– y2 n( )x n( )
n∑ y2 n( )2
n∑
w2– y2 n( )2
n∑
c12w1–= =
c12∆ w1TSv2– y1 n( )y2 n( )
n∑–= =
v2∆ w2∆ w1 c12∆( )+ Sv2 v2TSv2( )v2 w1w1
TSv2––= =
I w1w1T
–( )Sv2 v2TSv2( )v2–=
w2
w2∆ w2 c12 w1
w2∆ w2∆
w2 w1
c12 w2∆
c12∆ w2∆
v2∆ w2∆ w1 c12∆( )+=
y2 n( )x n( )n∑ y2 n( )2
n∑
w2– y2 n( )2
n∑
c12w1– y1 n( )y2 n( )n∑
w1–=
y2 n( ) x n( ) y n( )w2–( )n
∑ y2 n( ) y1 n( ) c12y2 n( )+( )n∑
w1–=
v2∆
w2 c12
118
n rule
PEX
on of
d pro-
lized
ralized
ende-
ted and
nother
(4.33)
where the adaptations of both and are “local.” (4.33) is actually the adaptatio
of the APEX model [Kun94], and all the above gives an intuitive explanation to the A
model and also shows that the APEX model is nothing but a local implementati
Sanger’s rule.
Generally, the sample-by-sample adaptation for the APEX model is as follows:
(4.34)
4.4 An Iterative Method for Generalized Eigendecomposition
Chatterjee etal. [Cha97] formulate the LDA as an heteroassociation problem an
pose an iterative method for LDA. Since the LDA is a special case of the genera
eigendecomposition, the iterative method can be further generalized for the gene
eigendecomposition.
Using the same notation as in 4.2, the iterative method for the generalized eig
composition can be described as
(4.35)
This method assumes that the covariance matrices have already been calcula
then the generalized eigenvectors can be iteratively obtained by (4.35). There is a
w2∆ y2 n( ) x n( ) y n( )w2–( )n
∑=
c12∆ y2 n( ) y1 n( ) c12y2 n( )+( )n∑–=
w2 c12
wi∆ yi n( ) x n( ) yi n( )wi– ∝
cji∆ yi n( ) yj n( ) yi n( )c+ –∝
APEX Adaptation
vi∆ S1vi viTS1vi( )S2vi S2 vjvj
TS1vi , i
j 1=
i 1–
∑–– 1 …k,= =
119
alternative method which uses some optimal relation in the problem formulation but
results in a more complex rule [Cha97]:
(4.36)
For a two zero-mean signal and , their covariance matrices can be esti-
mated on-line by using
(4.37)
where is a scalar gain sequence [Cha97]. Based on (4.37), an adaptive on-line algo-
rithm for the generalized eigendecomposition can be the same as (4.35) or (4.36) except
that all the terms there are estimated on-line; i.e.,
(4.38)
The convergence of this adaptive on-line algorithm can be shown by the stochastic
approximation theory [Cha97, Dia96], of which the major point is that a stochastic algo-
rithm will converge to the solution of its corresponding deterministic ordinary differential
equation (ODE) with probability 1 under certain conditions [Dia96, Cha97]. Formally, we
have a stochastic recursive algorithm:
(4.39)
where is a sequence of random vectors, is a sequence of step-size
parameters, is a continuous and bounded function and is a sequence of approx-
vi∆ S1vi viTS1vi( )S2vi S2 vjvj
TS1vi
j 1=
i 1–
∑––=
S1vi viTS2vi( )S1vi S1 vjvj
TS2vi
j 1=
i 1–
∑––+
x1 n( ) x2 n( )
S1 n( ) S1 n 1–( ) γ n( ) x1 n( )x1 n( )TS1 n 1–( )–( )+=
S2 n( ) S2 n 1–( ) γ n( ) x2 n( )x2 n( )TS2 n 1–( )–( )+=
γ n( )
S1 S1 n( )=
vi vi n( )=
S2 S2 n( )=
vj vj n( )=
θk 1+ θk βkf xk θk,( )+= k 0 1 2 …, , ,=
xk Rm∈ βk
f θk RP∈
120
).
ted as
, the
. As
hibi-
put is
. The
,
.
ende-
imations of a desired parameter vector . If the following assumptions are satisfied for
all fixed . ( is the expectation operator), then the corresponding deterministic ODE
for (4.39) is , and will converge to the solution of this ODE with prob-
ability 1 as approaches [Dia96].
• A-1. The step-size sequence satisfies and .
• A-2. is a bounded and measurable -valued function.
• A-3. For any fixed , the function is continuous and bounded (uniformly in
• A-4. There is a function
4.5 An On-line Local Rule for Generalized Eigendecomposition
As stated in 4.2.2, the generalized eigendecomposition problem can be formula
the problem of the optimal signal-to-signal ratio with decorrelation constraints. Here
network structure of the APEX model will be used for this more complicated problem
shown in Figure 4-5, are forward linear projecting vectors, are lateral in
tive connections used to force decorrelation among the output signals, but the in
switched between the two zero-mean signal and . at each time instant
overall projection is the combination of two types of connections, e.g.,
, etc. The i-th output for the input will be , etc
The proposed on-line local rule for the network in Figure 4-5 for the generalized eig
composition will be discussed in the following sections.
θo
θ E
tddθ
f θ( )=
)
θk θo
k ∞
βk 0→ βkk 0=
∞
∑ ∞=
f ,( ) RP
x f x,( ) x
f θ( ) βif xi θ,( )i k=
∞
∑
βii k=
∞
∑
⁄k ∞→lim E f xk θ,( )
k ∞→lim= =
)
wi Rm∈ cij
x1 n( ) x2 n( ) n
v1 w1=
v2 c12w1 w2+= xl n( ) yil n( ) viTxl n( )=
121
rmal
r-
Figure 4-5. Linear Projections with Lateral Inhibitions and Two Inputs
4.5.1 The Proposed Learning Rule for the First Projection
In this section, first, we will discuss the batch mode rule for the adaptation for the first
projection, then the stability analysis for the batch mode rule, finally the corresponding
adaptive on-line rule for the first projection.
A. The Batch Mode Adaptation Rule
Since there is no constraint for the optimization of the first projection , its output
doesn’t receive any lateral inhibition, thus as shown in Figure 4-5. The no
vector for the power field is , and the no
mal vector for the power field is . To
increase and decrease at the same time, the adaptation should be
w2w1 wk
y1yky2 c1k
c12c2k
x1 n( ) x2 n( )
v1 y1
v1 w1=
w1TS1w1 H1 w1( ) S1w1 y11 n( )x1 n( )
n∑= =
w1TS2w1 H2 w1( ) S2w1 y12 n( )x2 n( )
n∑= =
w1TS1w1 w1
TS2w1
122
tput
,
n,
thod
se
the
(4.40)
where is the learning step size, the Hebbian term will “enhance” the ou
signal , the anti-Hebbian term will “attenuate” the output signal
the scalar will play the balancing role. If is chose
then (4.40) is the gradient method. If , then (4.40) becomes the me
used in Diamantaras and Kung [Dia96]. Similar to Oja’s rule, the balancing scalar
can be simplified as ( ) because in this ca
the scalar can be simplified as the output energy, e.g. . In
sequel, the case will be discussed.
Figure 4-6. The Regions Related to the Variation of the Norm
w1∆ H1 w1( ) H2 w1( )f w1( ) , w1– w1 η w1∆+= =
η H1 w1( )
y11 n( ) H2 w1( )– y12 n( )
f w1( ) f w1( ) w1TS1w1( ) w1
TS2w1( )⁄=
f w1( ) w1Tw1=
f w1( )
f w1( ) w1TPw1= P S1 or S2 or S1 S2+( ),=
w1TS1w1 y11 n( )2
n∑=
f w1( ) w1TS1w1=
wTS2w 1=
w2
1 λmax⁄=
w2
1 λmin⁄=
D1D2
D3
D4
λmax: Maximum Eigenvalue of S2λmin: Minimum Eigenvalue of S2
w
123
be
xima-
ated
e,
nd
-ellip-
also
or of
qua-
B. The Stability Analysis of the Batch Mode Rule
The stationary points of the adaptation process (4.40) can be obtained by solving the
equation . Obviously, and all
the generalized eigenvectors which satisfy are stationary points.
Notice that in general the length of should be specified by ( are the gen-
eralized eigenvalues corresponding to . So, are further denoted by . In the case of
, we have and . We will show that
when , there is only one stable stationary point, that is the solution
. All the rest are unstable stationary points.
Let’s look at the case when , the rest will be similar. First, it can
shown that is not stable. To show this, we can calculate the first order appro
tion for the variation of , which is =
= . Since , the sign of
the variation totally depends on . As shown in Figure 4-6, when is loc
within the region ; i.e., , is positive and will increas
while is located outside the region ; i.e., , is negative a
will decrease. So, the stable stationary points should be located in the hyper
soid . Therefore, can not be a stable stationary point. This can
be shown by the Lyapunov local asymptotic stability analysis [Kha92]. The behavi
the algorithm described by (4.40) can be characterized by the following differential e
tion:
(4.41)
H1 w1( ) H2 w1( )f w1( )– S1 f w1( )S2–( )w1 0= = w1 0=
vio
S1vio
f vio( )S2vi
o=
vio
f vio( ) λi= λi
vio
vio
vλio
f w1( ) w1TS1w1= vλi
o( )TS1vλi
o λi= vλio( )
TS2vλi
o1=
f w1( ) w1TPw1=
w1 vλ1o
=
f w1( ) w1TS1w1=
w1 0=
w12 w1
2( )∆ 2w1T
w1∆( )=
2η w1TS1w1 w1
TS2w1f w1( )–( ) 2ηw1
TS1w1 1 w1
TS2w1–( ) w1
TS1w1 0≥
1 w1TS2w1– w1
D2 w1TS2w1 1< w1
2( )∆ w1
w1 D2 w1TS2w1 1> w1
2( )∆
w1
w1TS2w1 1= w1 0=
dw1
dt--------- Φ w1( ) H1 w1( ) H2 w1( )f w1( )– S1w1 w1
TS1w1( )S2w1–= = =
124
Obviously, this is a nonlinear dynamic system. The instability at can be deter-
mined by the position of the eigenvalues of the linearization matrix :
(4.42)
Since is positive definite, all its eigenvalues will be positive. So, the dynamic pro-
cess can not be stable at ; i.e., (4.40) is not stable at .
Similarly, , can be shown unstable too. Actually, in these
cases, the corresponding linearization matrix will be
(4.43)
By using ( ), and , we
have
(4.44)
The inequality in (4.44) holds because is the largest generalized eigenvalue. Similarly,
by using and , we have
(4.45)
So, the linearization matrix at , , are not definite, and thus they
are all saddle points and unstable.
The local stability of can be shown by the negativeness of the linearization
matrix at :
(4.46)
w1 0=
A
Aw1dd Φ w1( )
w1 0=S1 2S2w1w1
TS1– w1
TS1w1( )S2– w1 0= S1= = =
S1
dw1
dt--------- Aw1= w1 0= w1 0=
w1 vλio
= i 2 … m, ,=
A
A S1 2S2w1w1TS1– w1
TS1w1( )S2–
w1 vλ io=
=
S1 2λiS2vλio
vλio( )
TS2– λiS2–=
i 2 … m, ,=
vλ1o( )
TS2vλi
o0= i 2 … m, ,= vλ1
o( )TS1vλ1
o λ1= vλ1o( )
TS2vλ1
o1=
vλ1o( )
TAvλ1
o λ1 λi–= 0>
λ1
vλio( )
TS1vλi
o λi= vλio( )
TS2vλi
o1=
vλio( )
TAvλi
o2λi– 0<=
A w1 vλio
= i 2 … m, ,=
w1 vλ1o
=
A w1 vλ1o
=
A S1 2S2w1w1TS1– w1
TS1w1( )S2–
w1 vλ1o=
=
S1 2λ1S2vλ1o
vλ1o( )
TS2– λ1S2–=
125
Actually, it is not difficult to verify (4.47).
(4.47)
Since all the generalized eigenvectors , are linearly independent with
each other and they span the whole space, any non-zero vector can be the linear
combination of all the generalized eigenvectors with at least one coefficient being
non-zero; i.e., . Thus by (4.47), we have the quadratic form as fol-
lows:
(4.48)
So, all the eigenvalues of the linearization matrix are negative and thus is
stable. When , the stability analysis can be simi-
larly obtained. As shown previously, both the and the in the constrain
have three choices. For the simplicity of exposition, only will be used in the
rest of this chapter.
It should be noticed that when converges to , the scalar value
. So, can be the estimate of the largest eigenvalue.
C. The Local On-Line Adaptive Rule
When is used, (4.40) will be the same as (4.35), the adaptation rule
in Chatterjee etal. [Cha97]. However, here the calculations of the Hebbian term ,
the anti-Hebbian term and the balancing scalar are all local, avoiding the
vλ1o( )
TAvλ1
o λ1 2λ1– λ1– 2λ1– 0<= =
vλio( )
TAvλi
o λi λ1 0, i<– 2 … m, ,= =
vλio( )
TAvλj
o0 i j≠,=
vλio
i 1 … m, ,=
x Rm∈
vλio
x aivλio
i 1=
m
∑= xTAx
xTAx ai
2vλi
o( )TAvλi
o
i 1=
m
∑ 0<=
A w1 vλ1o
=
f w1( ) w1TPw1 p, S2 or S1 S2+( )= =
P S viTSvj
o0=
P S S1= =
w1 vλ1o
f w1( ) f vλ1o( ) λ1= = f w1( )
f w1( ) w1TS1w1=
H1 w1( )
H2 w1( )– f w1( )
126
direct matrix multiplications in (4.35) and resulting in a drastic reduction in computation.
When the exponential window is used to estimate each term in (4.40), we have
(4.49)
where the step size should decrease with the time index . The number of multipli-
cations required by (4.49) is ( is the dimension of the input signals), while the
number of multiplications required by the method (4.35) of Chatterjee etal. [Cha97] is
.
The convergence of the stochastic algorithm in (4.49) can be shown by the stochastic
approximation theory in the same way as Chatterjee etal. [Cha97]. The simulation results
also show the convergence when the instantaneous values for the Hebbian and the anti-
Hebbian term and are used; i.e.,
(4.50)
Notice that in both (4.49) and (4.50), when convergence is achieved, the balancing
scalar will approach its batch mode version . As shown in the
above, the batch mode scalar will approach the largest generalized eigenvalue
when approaches . So, we can conclude that and all the quantities
in both (4.49) and (4.50) have been fully utilized.
w1 n 1+( ) w1 n( ) η n( ) w1∆ n( )+=
w1∆ n( ) H1 w1 n,( ) H2 w1 n,( )f w1 n,( )–=
H1 w1 n,( ) H1 w1 n 1–,( ) α y11 n( )x1 n( ) H1 w1 n 1–,( )–[ ]+=
H2 w1 n,( ) H2 w1 n 1–,( ) α y12 n( )x2 n( ) H2 w1 n 1–,( )–[ ]+=
f w1 n,( ) f w1 n 1–,( ) α y11 n( )2f w1 n 1–,( )–[ ]+=
η n( ) n
8m 2+ m
6m2
3m+
H1 w1( ) H2 w1( )
w1 n 1+( ) w1 n( ) η n( ) w1∆ n( )+=
w1∆ n( ) H1 w1 n,( ) H2 w1 n,( )f w1 n,( )–=
H1 w1 n,( ) y11 n( )x1 n( )=
H2 w1 n,( ) y12 n( )x2 n( )=
f w1 n,( ) f w1 n 1–,( ) α y11 n( )2f w1 n 1–,( )–[ ]+=
f w1 n,( ) f w1( ) w1TS1w1=
f w1( ) λ1
w1 vλ1o
f w1 n,( ) λ1→
127
,
eb-
ation
4.5.2 The Proposed Learning Rules for the Other Connections
In this section, the adaptation rule for both lateral connections and the feedforward
connections of the other projections are discussed. For simplicity, only
is considered. The other cases are similar. Suppose already has reached its final posi-
tion; i.e., and and . Again, we will first
discuss the batch mode rule for both the feedforward connections and the lateral inhib-
itive connection , then its stability analysis and finally the corresponding local on-line
adaptive rule.
A. The Batch Mode Adaptation Rule
Similar to 4.3.4, the adaptation rule can be described as two parts: the decorrelation
and the optimal signal-to-signal ratio search. The decorrelation between the output signal
and can be achieved by the anti-Hebbian learning of the inhibitive connec-
tion , and the optimal signal-to-signal ratio search can be achieved by the similar rule
as the previous section 4.5.1 for the feedforward connection . So, we have
(4.51)
where is the cross-correlation between two
output signals and , is the Hebbian
term which will “enhance” the output signal ,
is the anti-Hebbian term which will “attenuate” the output signal
is the scalar playing a balancing role between the H
bian term and the anti-Hebbian term , is the step size for decorrel
process, is the step size for feedforward adaptation.
v2 c12w1 w2+=
w1
w1 vλ1o
= vλ1o( )
TS1vλ1
o λ1= vλ1o( )
TS2vλ1
o1=
w2
c12
y11 n( ) y21 n( )
c12
w2
c12∆ C w1 v2,( )=
w2∆ H1 v2( ) H2 v2( )f v2( )–= c12 c12 ηc c12∆–=
w2 w2 ηw w2∆+=
C w1 v2,( ) w1TS1v2 y11 n( )y21 n( )
n∑= =
y11 n( ) y21 n( ) H1 v2( ) S1v2 y21 n( )x1 n( )n
∑= =
y21 n( ) H2 v2( ) S2v2 y22 n( )x2 n( )n
∑= =
y22 n( )
f v2( ) v2TS1v2 y21 n( )2
n∑= =
H1 v2( ) H2 v2( ) ηc
ηw
128
ateral
.
=
n
ust-
=
for
rward
then the
hat is
con-
that
odel.
diffi-
First, let’s consider the case where is fixed. Then, as pointed out in 4.1, the l
inhibition connection in (4.51) will decorrelate the output signals and
In fact, we have the variation for the cross-correlation
, and = =
. If is small enough such that , the
. When the decorrelation is achieved; i.e., , there will be no adj
ment in , namely will remain the same.
Second, let’s consider the case with fixed. Then we have =
. By the conclusion in the previous section 4.5.1, we know that
is in the direction to increase the signal-to-signal ratio .
Combining these two points, intuitively, we can say that as long as the step size
the decorrelation process is large enough relative to the step size for the feedfo
process such that the decorrelation process is faster than the feedforward process,
optimal signal-to-signal ratio search will basically take place within the subspace t
orthogonal to the first eigenvector; i.e., , and the whole process will
verge to the solution; i.e., and . However, we should notice
does not necessarily mean , which is the case for the APEX m
Actually can take any value, but the overall projection will converge.
B. The Stability Analysis of the Batch Mode Rule
The stationary points of (4.51) can be obtained by solving both and
. Obviously, and ( ) are the stationary points for the
dynamic process of (4.51). Based on the results in the previous section 4.5.1, it is not
cult to show that and ( ) are all unstable. Actually, if the initial
w2
c12 y11 n( ) y21 n( )
C∆
w1TS1 v2∆( ) w1
TS1 ηc c12∆( )w1–( ) ηc w1
TS1w1( )C–= = C n 1+( ) C n( ) C∆+
1 ηcw1TS1w1–( )C n( ) ηc 1 ηw1
TS1w1– 1<
C n( )n ∞→lim 0= C 0=
c12 c12
c12 v2∆ w2∆
H1 v2( ) H2 v2( )f v2( )– v2∆
J v2( )
ηc
ηw
S1 v2TS1vλ1
o0=
v2 vλ2o→ f v2( ) λ2→
v2 vλ2o→ c12 0→
c12
c12∆ 0=
w2∆ 0= v2 0= v2 vλio
= i 2 … m, ,=
v2 0= v2 vλio
= i 3 … m, ,=
129
state of is in the subspace orthogonal to ; i.e., and
, then will be 0 and the adjustment of will be
which is also orthogonal to ; i.e.,
. So, will also be orthogonal to . This means that once is
in the subspace which satisfies the decorrelation constraint, it will remain in this subspace by the
rule in (4.51). In this case, the adaptation of (4.51) will become
in the subspace orthogonal to , which is exactly the same as the case of the first pro-
jection except that the search is within the subspace orthogonal to . According to the
result in 4.5.1, we know that the stationary points and ( )
are all unstable even in the subspace.
To show that is stable, we can study the overall process
. Its corresponding differential equation is
(4.52)
where will remain unchanged after the convergence of the first projection. The
linearization matrix of (4.52) at and is
(4.53)
As a comparison, the corresponding linearization matrix of (4.35) of the method in
Chatterjee etal. [Cha97] can be similarly obtained:
(4.54)
v2 S1 vλ1o
v2TS1vλ1
o0=
v2TS2vλ1
ov2
TS1vλ1
o( ) λ1⁄ 0= = c12∆ v2
v2∆ w2∆ S1v2 v2TS1v2( )S2v2–= = S1 vλ1
o
v2∆( )Tvλ1
o0= v2 η v2∆+ S1 vλ1
ov2
v2∆ S1v2 v2TS1v2( )S2v2–=
S1 vλ1o
S1 vλ1o
v2 0= v2 vλio
= i 3 … m, ,=
v2 vλ2o
=
v2∆ Φ v2( ) w2∆ηc
ηw------w1 c12∆( )+= =
tdd Φ v2( ) S1v2 v2
TS1v2( )S2v2–
ηc
ηw------w1w1
TS1v2–=
w1 vλ1o
=
A w1 vλ1o
= v2 vλ2o
=
Av2dd Φ v2( ) S1
ηc
ηw------w1w1
TS1– 2S2v2v2
TS1 v2
TS1v2( )S2–
w1 vλ1o= v2, vλ2
o=–= =
S1
ηc
ηw------vλ1
ovλ1
o( )TS1– 2λ2S2vλ2
ovλ2
o( )TS2 λ2S2––=
B
B S1 S2vλ1o
vλ1o( )
TS1– 2λ2S2vλ2
ovλ2
o( )TS2 λ2S2––=
130
Figure 4-7. The distribution of the real parts of the eigenvalues for in 1000 trials for sig-nals with dimension 10
Notice that is not symmetric. So, the eigenvalues of may be complex. To show
the stability of for (4.51), we need to show the negativeness of all the real parts
of the eigenvalues of the matrix . Although there is no rigorous proof that the real pars
of all the eigenvalues of are negative (in this case, it is difficult to show the negative-
ness because is not symmetric), the Monte Carlo trials show the negativeness as long as
the step size is large enough relative to the step size . Figure 4-7 is the results of
1000 trials for randomly generated signals with dimension 10 and the condition that
. As can be seen from the figure, all the real parts of the eigenvalues of are
negative. To compare the proposed method with the one in Chatterjee etal. [Cha97], the
eigenvalues of the linearization matrix for the method in Chatterjee etal. [Cha97] (i.e. in
(4.54)) are also calculated. The mean value of the real parts for 10 eigenvalues of and
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000−4000
−3500
−3000
−2500
−2000
−1500
−1000
−500
0
500
A
A A
v2 vλ2o
=
A
A
A
ηc ηw
ηc ηw= A
B
A B
131
are calculated for each trial. The mean values are displayed in Figure 4-8 from which we
can see that most of the mean values for is even less than the corresponding mean val-
ues for , which somehow means that the most real parts of the eigenvalues for are
even less than those of . This indicates that the dynamic process characterized by
will converge faster than the dynamic process characterized by
. This may explain the observations that the proposed method usually have
a faster convergence speed than the method in Chatterjee etal. [Cha97] in our simulations.
Figure 4-9 further shows the mean difference for and , i.e. mean(A) - mean(B), from
which we can see all the values are negative, which means that the means of the real parts
of the eigenvalues for are less than those of .
Figure 4-8. The Comparison of the mean of the real parts of the eigenvalues of and in the same trials as in Figure 4-7.
A
B A
B
dv2 dt⁄ Av2=
dv2 dt⁄ Bv2=
A B
A B
in proposed method of (4.50) in the method in (4.35) [Cha97]B
0 200 400 600 800 1000−4000
−3500
−3000
−2500
−2000
−1500
−1000
−500
0
500
0 200 400 600 800 1000−4000
−3500
−3000
−2500
−2000
−1500
−1000
−500
0
500
A
A B
132
Figure 4-9. The difference of the mean real parts of the eigenvalues of and
C. The Local On-Line Adaptive Rule
To get an adaptive on-line algorithm, we can again use the exponential window to esti-
mate the terms in (4.51). Thus, we have
(4.55)
where is a scalar between 0 and 1. The convergence of (4.55) can also be related to the
solution to its corresponding deterministic ordinary differential equation characterized by
(4.51) by the stochastic approximation theory [Dia96, Cha97].
0 10 20 30 40 50 60 70 80 90 100−4000
−3500
−3000
−2500
−2000
−1500
−1000
−500
0
500
A B
c12 n( )∆ C w1 v2 n, ,( )=
w2 n( )∆ H1 v2 n,( ) H2 v2 n,( )f v2 n,( )–=
H1 v2 n,( ) H1 v2 n 1–,( ) α y21 n( )x1 n( ) H1 v2 n 1–,( )–[ ]+=
H2 v2 n,( ) H2 v2 n 1–,( ) α y22 n( )x2 n( ) H2 v2 n 1–,( )–[ ]+=
f v2 n,( ) f v2 n 1–,( ) α y21 n( )2f v2 n 1–,( )–[ ]+=
C w1 v2 n, ,( ) C w1 v2 n 1–, ,( ) α y11 n( )y21 n( ) C w1 v2 n 1–, ,( )–[ ]+=
α
133
The number of the multiplications required by the proposed method for the first two
projections at each time instant is versus required by the method in
(4.35) of Chatterjee etal. [Cha97]. Simulation results also show the convergence when
instantaneous values are used for , and ; i.e.,
(4.56)
4.6 Simulations
Two 3-dimensional zero-mean colored Gaussian signals are generated with 500 sam-
ples each. Table 1 compares the results of the numerical method with those of the pro-
posed adaptive methods after 15000 on-line iterations. In Experiment 1, all the terms in
(3) and (4) are estimated on-line by an exponential window with , but in Exper-
iment 2, all , and use instantaneous values while and remain the
same. As an example, Figure 2 (a) shows the adaptation process of Experiment 2. Figure 2
(b) compares the convergence speed between the proposed method and the method in
Chatterjee etal. [Cha97] for the adaptation of in batch mode when . There are
100 trials (each with the same initial condition). The vertical axis is the minimum number
of iterations for convergence (with the best step size obtained by exhaustive search). Con-
vergence is claimed when the difference between and is less than 0.01 for 10
consecutive iterations. Figure 2 (c) and (d) respectively show a typical evolution of
n 16m 9+ 8m2
8m+
H1 v2 n,( ) H2 v2 n,( ) C w1 v2 n, ,( )
c12 n( )∆ C w1 v2 n, ,( )=
w2 n( )∆ H1 v2 n,( ) H2 v2 n,( )f v2 n,( )–=
H1 v2 n,( ) y21 n( )x1 n( )=
H2 v2 n,( ) y22 n( )x2 n( )=
f v2 n,( ) f v2 n 1–,( ) α y21 n( )2f v2 n 1–,( )–[ ]+=
C w1 v2 n, ,( ) y11 n( )y21 n( )=
α 0.003=
H1 H2 C f w1( ) f v2( )
v2 w1 vλ1o
=
J v2( ) J v2o( )
J v2( )
134
way.
model
e sta-
and in one of the 100 trials where the eigenvalues of the linearization matrices are
, , for of the proposed method and , ,
for of the method in Chatterjee etal. [Cha97]. Figure 4-11 shows the process of the
batch mode rule in (4.51).
4.7 Conclusion and Discussion
In this chapter, the relationship between the Hebbian rule and the energy of the output
of a linear transform and the relationship between the anti-Hebbian rule and the cross cor-
relation of two outputs connected by a lateral inhibitive connection are discussed. We can
see that an energy quantity is based on the relative position of each sample to the mean of
all samples. Thus, each sample can be treated independently and an on-line adaptation rule
is relatively easy to derive while the information potential and the cross information
potential are based on the relative position of each pair of data samples and an on-line
adaptation rule for the information potential or the cross information potential is relatively
difficult to obtain.
The information-theoretic formulation and the formulation based on energy quantities
for the eigendecomposition and the generalized eigendecomposition are introduced. The
energy based formulation can be regarded as a special case of the information-theoretic
formulation when data are Gaussian distributed.
Based on the energy formulation for the eigendecomposition and the relationship
between the energy criteria and the Hebbian and the anti-Hebbian rules, we can under-
stand Oja’s rule, Sanger’s rule and the APEX model in an intuitive and effective
Starting from such an understanding, we propose a similar structure as the APEX
and an on-line local adaptive algorithm for the generalized eigendecomposition. Th
C
28.3– 6.7j+ 28.3– 6.7j– 1.5– A 21.5– 1.7– 0.4–
B
135
bility analysis of the proposed algorithm is given and the simulation shows the validity
and the efficiency of the proposed algorithm.
Based on the information-theoretic formulation, we can generalize the concept of the
eigendecomposition and the generalized eigendecomposition by using the entropy differ-
ence in 4.2.1. For non-Gaussian data and nonlinear mapping, the information potential can
be used to implement the entropy difference to search for an optimal mapping such that
the output of the mapping will convey the most information about the first signal
while it will contain the least information about the second signal at the same time.
This can be regarded as a special case of the “information filtering.”
Table 4-1. COMPARISON OF RESULTS. and are the generalized
eigenvalues. and are corresponding normalized eigenvectors
Numerical Method Experiment 1 Experiment 2
45.9296570 45.9295867 45.9296253
-0.1546873 -0.1550365 0.1549409
-0.8400303 -0.8396349 0.8397703
0.5200200 0.5205544 -0.5203643
6.1679926 6.1678943 6.1679234
-0.2162832 -0.2147684 0.2175495
0.9668235 0.9672048 -0.9664919
0.1359184 0.1356071 -0.1362553
x1 n( )
x2 n( )
J vλ1o( ) J vλ2
o( )
vλ1o
vλ2o
J vλ1o( )
vλ1o
1( )
vλ1o
2( )
vλ1o
3( )
J vλ2o( )
vλ2o
1( )
vλ2o
2( )
vλ2o
3( )
136
Figure 4-10. (a) Evolution of and in Experiment 2. (b) Comparison of Con-
vergence Speed in terms of the minimum number of iterations. (c) Typical adaptation curve of of two methods when initial condition is the same and the best step size is
used. (d) Typical adaptation curve of in the same trial as (c). In (b), (c) and (d), the solid lines represent the proposed method while the dashed lines represent the method in Chat-
terjee etal. [Cha97].
0 5000 10000 150000
10
20
30
40
50Adaptation Process
0 20 40 60 80 1000
50
100
150
200
250
300Comparison of Convergence on 100 Trials
0 100 200 3002
4
6
8
10
12Comparison of the Evolution of J(v2)
0 100 200 300−10
−5
0
5
10Comparison of Cross−Correlation
(a) (b)
J v1( )
J v2( )
the proposed method
the method in
of the proposed method of the proposed method
of the method in of the method in
J v2( )
J v2( )
time index n
iterations iterations(c) (d)
C
C
trials
min
imum
num
ber
of it
erat
ions
Chatterjee etal. [Cha97]
Chatterjee etal. [Cha97]
Chatterjee etal. [Cha97]
J v1( ) J v2( )
J v2( )
C
137
Figure 4-11. The Evolution Process of the Batch Mode Rule
0 200 400 600 800 1000 1200 1400 1600 1800 20000
5
10
15
20
25
30
35
40
45
50
J v1( )
f v1( )
J v2( )
f v2( )
CHAPTER 5
APPLICATIONS
5.1 Aspect Angle Estimation for SAR Imagery
5.1.1 Problem Description
The relative direction of a vehicle with respect to the radar sensor in SAR (synthetic
aperture radar) imagery is normally called the aspect angle of the observation, which is an
important piece of information for vehicle recognition. Figure 5-1 shows typical SAR
images of a tank or military personnel carrier with different aspect angles.
Figure 5-1. SAR Images of a Tank with Different Aspect Angles
20 40 60 80 100 120
20
40
60
80
100
120
20 40 60 80 100 120
20
40
60
80
100
120
20 40 60 80 100 120
20
40
60
80
100
120
20 40 60 80 100 120
20
40
60
80
100
120
Occlusion
138
139
SAR
tion of
n the
hip is
nsion
tar-
denote
enoted
d the
ulated
roba-
age
is to
We are given some training data (both SAR images and the corresponding true aspect
angles). The problem is to estimate the aspect angle of the vehicle in a testing SAR image
based on the information given in the training data. This is a very typical problem of
“learning from examples.” As can be seen from Figure 5-1, the poor resolution of
combined with speckle and the variability of scattering centers makes the determina
the aspect angle of a vehicle from its SAR image a nontrivial problem. All the data i
experiments are from the MSTAR public release database [Ved97].
5.1.2 Problem Formulation
Let’s use to denote a SAR image. In the MSTAR database [Ved97], a target c
usually 128-by-128. So, can usually be regarded as a vector with dime
. Or, we can just use the center region of since a
get is located in the center of each image in the MSTAR database. Let’s use to
the aspect angle of a target SAR image. Then, the given training data set can be d
by (the upper case and represent random variables an
lower case and represent their samples).
In general, for a given image , the aspect angle estimation problem can be form
as a maximum a posteriori probability (MAP) problem:
(5.1)
where is the estimation of the true aspect angle, is the a posteriori p
bility density function (pdf) of the aspect angle given , is the pdf of the im
, is the joint pdf of image and aspect angle . So, the key issue here
X
X
128 128× 16384= 80 80× 6400=
A
xi ai,( ) i 1 … N, ,= X A
x a
x
a maxa
arg fA X x a,( ) maxa
argfAX x a,( )
fX x( )--------------------- max
aarg fAX x a,( )= = =
a fA X x a,( )
A X fX x( )
X fAX x a,( ) X A
140
is
an con-
infor-
able ,
:
ure
ional
n fil-
on is
ween
utual
ation
e no
here
with
estimate the joint pdf . However, the very high dimensionality of the image vari-
able make it very difficult to obtain a reliable estimation. Dimensionality reduction (or
feature extraction) becomes necessary. An “information filter” (where
the parameter set) is needed such that when an image is the input, its output c
vey the most information about the aspect angle and discard all the other irrelevant
mations. Such an output is the feature for aspect angle. Based on this feature vari
the aspect angle estimation problem can be reformulated by the same MAP strategy
(5.2)
where is the joint pdf of the feature and the aspect angle .
The crucial point for this aspect angle estimation scheme is how good the feat
turns out to be. Actually, the problem of reliable pdf estimation in a high dimens
space is now converted to the problem of building a reliable aspect angle “informatio
ter” only on the given training data set. To achieve this goal, the mutual informati
used and the problem of finding an optimal “information filter” can be formulated as
(5.3)
that is to find the optimal parameter set such that the mutual information bet
the feature and the angle is maximized. To implement this idea, the quadratic m
information based on the Euclidean distance and its corresponding cross inform
potential between the feature and the angle will be used. There will b
assumption made on either the data or the “information filter.” The only thing used
will be the training data set itself. In the experiments, it is found that a linear mapping
fAX x a,( )
X
y q x w,( )= w
x y
Y
a maxa
arg fAY y a,( ) y, q x w,( )= =
fAY y a,( ) Y A
Y
woptimal maxw
I Y q X w,( )= A,( )arg=
woptimal
Y A
IED
VED Y A
141
me.
ion of
by
all the
two outputs is good enough for the aspect angle information filter ( ). The
system diagram is shown bellow.
Figure 5-2. System Diagram for Aspect Angle Information Filter
One may notice that the joint pdf is the natural “by-product” of this sche
Recall that the cross information potential is based on the Parzen window estimat
the joint pdf . So, there is no need to further estimate the joint pdf
any other method.
Since the angle variable is a periodic one, e.g. 0 should be the same as 360,
angles are put in the unit circle; i.e., the following transformation is used.
(5.4)
So, the actual angle variable used is , a two dimensional variable.
Y Y1 Y2,( )T=
Images
Angles
x y
a
CrossInformation Potential Field
Forces
Back-Propagation
Information
Angles A
Image X Information Force
Back-Propagation
fAY y a,( )
fAY y a,( ) fAY y a,( )
A
A1 A( )cos=
A2 A( )sin=
Λ A1 A2,( )=
142
In the experiment, it is also found that the discrimination between two angles with 180
degrees difference is very difficult. Actually, it can be seen from Figure 5-1 that it is diffi-
cult to tell where is the front and where is the back of a vehicle although the overall direc-
tion of the vehicle is clear to our eyes. Most of the experiments are just to estimate the
angle within 180 degrees, e.g. 240 degree will be treated as 240-180 = 60 degree. Actu-
ally, the following transformation is used in this case.
(5.5)
In this case the actual angle variable is . Correspondingly, the estimated
angles will be divided by 2.
Since the joint pdf where is the vari-
ance for the Gaussian Kernel for the feature , is the variance for the Gaussian Kernel
for the actual angle , and all the angle data are in the unit circle, the search for the
optimal angle can be implemented by scanning
the unit circle in plane. Then the real estimated angle can be for the case
without 180 degree difference.
5.1.3 Experiments of Aspect Angle Estimation
There are three classes of vehicles with some different configurations. Totally, there
are 7 different vehicle types. They are BMP2_C21, BMP2_9563, BMP2_9566,
BTR70_C71, T72_132, T72_S7.
To use the ED-CIP to implement the mutual information, the kernel size and
have to be determined. The experiments show that the training process and the perfor-
A1 2A( )cos=
A2 2A( )sin=
Λ A1 A2,( )=
fAY y a,( ) 1N---- G y yi– σy
2,( )G a ai– σa2,( )
i 1=
N
∑= σy2
Y σa2
Λ ai
a maxa
arg fAY y a,( ) y, q x w,( )= =
A1 A2,( ) a 2⁄
σy2 σa
2
143
mance are not sensitive to them. The typical values are and . There
will be no big performance difference if or or is used.
The step size is usually around . It can be adjusted according to the training
process.
Figure 5-3. Training: BMP2_C21 (0-180 degree); Testing: BMP2_C21 (0-180 degree) Error Mean: 3.45 (degree); Error Deviation: 2.58 (degree)
Figure 5-3 shows a typical result. The training data are chosen from BMP2_C21
within the angle range from 0 to 180 degrees, totally 53 images and their corresponding
angles with an approximate 3.5 degrees difference between each neighboring angle pair.
The testing data are from the same vehicle in the same degree range 0-180 but not
included in the training data set. The left graph shows the output data distribution for both
training and testing data. It can be seen that the training data form a circle, the best way to
represent angles. The testing images are first fed into the information filter to obtain the
features. The triangles in the left graph of Figure 5-3 indicate these features. The aspect
angles are then estimated according to the method described above. The right graph shows
σy2
0.1= σa2
0.1=
σy2
0.01= σy2
1.0= σa2
1.0=
1.5 108–×
Output data (angle feature) distribution. estimated angle and true valueTriangle--testing dataDiamond--training data;
Y1
Y2
images
angles
(solid line)
144
the comparison between the estimated angles (the dots indicated by x) and the true value
(solid line) (the testing image are sorted according to their true aspect angles).
Figure 5-4. Training: BMP2_C21 (0-360 degree); Testing: BMP2_C21 (0-360 degree) Error Mean: 12.40 (degree); Error Deviation: 20.56 (degree)
Figure 5-5. Training: BMP2_C21 (0-180 degree); Testing: T72_S7 (0-360 degree) Error Mean: 6.18 (degree); Error Deviation: 5.19 (degree)
Output data (angle feature) distribution.Triangle--testing dataDiamond--training data;
big error
Y1
Y2
images
angles
estimated angle and true value(solid line)
Output data (angle feature) distribution.Triangle--testing dataDiamond--training data;
(180 difference is ignored)
Y2
Y1 images
angles
estimated angle and true value(solid line)
145
ability
the
nerali-
ional
bove.
patible
n the
21_t1”
in the
nge of
e set
ent
Figure 5-4 shows the result of the training on the same BMP2_C21 vehicle but the
angle range is from 0 to 360 degree. Testing is done on the same BMP2_C21 within the
same angle range (0 to 360) but all the testing data are not included in the training data set.
As can be seen, the results become worse due to the difficulty of telling the difference
between two images with 180 degree angle difference. The figure also shows that the
major error occurs when 180 degree difference can not be correctly recognized (The big
errors in the figure are about 180 degree).
Figure 5-5 shows the result of training on the personnel carrier BMP2_C21 within the
range of 180 degree but testing on the tank T72_S7 within the same range (0-180 degree).
The tank is quite different from the personnel carrier because the tank has a cannon but the
carrier hasn’t. The good result indicate the robustness and the good generalization
of the method. The following two experiments will further give us an overall idea on
performance of the method and they further confirm the robustness and the good ge
zation ability of the method. Inspired by the result of the method, we apply the tradit
MSE criterion by putting the desired angles in the unit circle in the same way as the a
The results are shown bellow from which we can see that both methods have a com
performance but ED-CIP method converges faster than the MSE method.
In the experiment 1, the training is based on 53 images from BMP2_C21 withi
range of 180 degrees. The results are shown in Table 5-1. The testing set “bmp2_c
means the vehicle bmp2_c21 within the range of 0-180 degree but not included
training data set, the set “bmp2_c21_t2” means the vehicle bmp2_c21 within the ra
180-368 degree but the 180 degree difference is ignored in the estimation, th
“t72_132_tr” means the vehicle t72_132 which will be used for training in the experim
146
set
80)
-180)
2, the set “t72_132_te” means the vehicle t72_132 but not included in the
“t72_132_tr.”
Table 5-1. The Result of Experiment 1; Training on bmp2_c21_tr (53 images) (0-1
Vehicle Results (ED-CIP)error mean (error deviation)
Results (MSE)error mean (error deviation)
bmp2_c21_tr 0.54 (0.40) 1.05e-5 (8.293e-6)
bmp2_c21_t1 2.76 (2.37) 2.48 (2.12)
bmp2_c21_t2 2.63 (2.10) 2.79 (2.43)
t72_132_tr 7.12 (5.36) 7.42 (5.12)
t72_132_te 4.75 (3.21) 4.09 (3.02)
bmp2_9563 4.25 (3.62) 3.77 (3.16)
bmp2_9566 3.81 (3.16) 3.60 (2.97)
btr70_c71 3.18 (2.84) 2.88 (2.47)
t72_s7 6.65 (5.04) 6.95 (5.27)
Table 5-2. The Result of Experiment 2. Training on bmp2_c21_tr and t72_132_tr. (0
Vehicle Results (ED-CIP)error mean (error deviation)
Results (MSE)error mean (error deviation)
bmp2_c21_tr 1.99 (1.52) 0.18 (0.14)
bmp2_c21_te 2.96 (2.41) 0.18 (0.11)
t72_132_tr 1.97 (1.48) 0.17 (0.13)
t72_132_te 3.01 (2.66) 0.17 (0.13)
bmp2_9563 2.97 (2.35) 2.54 (1.90)
bmp2_9566 3.32 (2.44) 2.80 (2.19)
btr70_c71 2.80 (2.33) 2.42 (1.83)
t72_s7 3.80 (2.57) 3.38 (2.40)
147
a set
e the
in the
f the
http://
ean is
pproxi-
), (b),
In Experiment 2, training is based the data set “bmp2_c21_tr” and the dat
“t72_132_tr.” The experimental results are shown in Table 5-2, from which we can se
improvement of the performance when more vehicles and more data are included
training process.
More experimental results can be found in the paper [XuD98] and the reports o
DARPA project on Image Understanding (the reports can be found in the web site “
www.cnel.ufl.edu/~atr/.”. From the experiment results, we can see that the error m
around 3 degree. This is reasonable because the angles of the training data are a
mately 3 degrees apart between the neighboring angles.
Figure 5-6. Occlusion Test with Background Noise. The images corresponding to (a(c), (d), (e) and (f) are shown in Figure 5-7.
Output data (angle feature) distribution.Triangle--testing dataDiamond--training data;
(a)(b)
(c)
(e)
(d) (f)
estimated angle and true value(solid line)
148
Figure 5-7. The occluded images corresponding to the points in Figure 5-6
(a) (b)
(c) (d)
(e)
149
ack-
back-
luded
ost part
tness
e out-
rpen-
anged
5.1.4 Occlusion Test on Aspect Angle Estimation
To further test the robustness and the generalization ability of the method, occlusion
tests are conducted, where the testing input SAR images are contaminated by background
noise or the vehicle image is occluded by the SAR image of trees.
Figure 5-6 shows the result of “Occlusion Test,” where a squared window with b
ground noise enlarges gradually until all the image is occluded and replaced by the
ground noise as shown in Figure 5-1 and Figure 5-7. Figure 5-7 shows the occ
images corresponding to the points in Figure 5-6. We can see that even when the m
of the target is occluded, the estimation is still good, which simply verifies the robus
and the generalization ability of the method. When the occluding square enlarges, th
put point (feature point) goes away from the circle, but the direction is essentially pe
dicular to the circle, which means the nearest point in the circle is essentially unch
and the estimation of the angle basically remains the same.
Figure 5-8. SAR Image of Trees. The squared region was cut for the occlusion purpose
150
Figure 5-8 is a SAR image of trees. One region was cut to occlude the target images to
see how robust the method is and how good the generalization can be made by the method.
As shown in Figure 5-10 and Figure 5-11, the cut region of trees is slid over the target
image from the lower right corner to the upper left corner. The occlusion is made by aver-
aging the overlapped target pixels and tree pixels. Figure 5-10 shows two particular occlu-
sions, in the right one of which, the most part of the target is occluded but the estimation is
still good. Figure 5-9 shows the overall results when sliding the occlusion square region.
One may notice that the result gets better when the whole image is overlapped by the tree
image. The explanation is that the occlusion is the average of both the target pixels and the
tree pixels in this case, and the center region of the tree image has small pixel values while
the center region of the target image has large pixel values, therefore, when the whole tar-
get image is overlapped by the tree image, the occlusion of the target (the center region of
the target image) becomes even lighter.
Figure 5-9. Occlusion Test with SAR Image of Trees. The images corresponding to the points (a) and (b) are shown in Figure 5-10. The images corresponding to the points (c)
and (d) are shown in Figure 5-11.
Output data (angle feature) distribution.Triangle--testing dataDiamond--training data;
(a) (b)
(c)
(d)
estimated angle and true value(solid line)
151
Figure 5-10. Occlusion with SAR Image of Trees. Output data distribution (Diamond: training data; Triangle: testing data). Upper Images are occluded images. Lower Images
show the occluded regions. The true angle is 101.19
Figure 5-11. Occlusion with SAR Image of Trees. Output data distribution (Diamond: training data; Triangle: testing data). Upper Images are occluded images. Lower Images
show the occluded regions. The true angle is 101.19
Estimated Angle: 100.6 Estimated Angle: 105.2
(a) (b)
Estimated Angle: 160.6 Estimated Angle: 99.6
(c) (d)
152
ality
hich a
han-
ror is
class
nfor-
le for
s iden-
ning
prob-
5.2 Automatic Target Recognition (ATR)
In this section, we will see how important the mutual information will be for the per-
formance of pattern recognition, and how the cross information potential can be applied to
automatic target recognition of SAR Imagery.
First, let’s look at the lower bound of recognition error specified by Fano’s inequ
[Fis97].
(5.6)
where is a variable for the identity of classes, is a feature variable based on w
classification will be conducted, denotes the number of classes, is S
non’s conditional entropy of given . Fano’s inequality means the classification er
lower bounded by the quantity which is determined by the conditional entropy of the
identity given the recognition feature . By a simple manipulation, we get
(5.7)
which means that to minimize the lower bound of the error probability, the mutual i
mation between the class identity and the feature should be maximized.
5.2.1 Problem Description and Formulation
Let’s use to denote the variable for target images, and to denote the variab
the class identity. We are given a set of training images and their corresponding clas
tities . A classifier need to be established based only on this trai
data set such that when given a target image , it can classify the image. Again, the
lem can be formulated as a MAP problem:
P c c≠( )Hs c y( ) 1–
Θ c( )( )log----------------------------≥
c y
Θ c( ) Hs c y( )
c y
y
P c c≠( )Hs c( ) I c y,( )– 1–
Θ c( )( )log--------------------------------------------≥
c y
X C
xi ci,( ) i 1 … N, ,=
x
153
tput
irrele-
sifica-
tegy:
.
ation
e pdf
a reli-
. To
o sug-
ter”
ween
idea,
(5.8)
where is the a posteriori probability of the class identity given the image ,
is the joint pdf of image and the class identity . So, similarly, the key issue
here is to estimate the joint pdf . However, the very high dimensionality of the
image variable make it very difficult to obtain a reliable estimation. Dimensionality
reduction (or feature extraction) again is necessary. An “information filter”
(where is parameter set) is needed such that when an image is its input, its ou
can convey the most information about the class identity and discard all the other
vant informations. Such an output is the feature for classification. Based on the clas
tion feature , the classification problem can be reformulated by the same MAP stra
(5.9)
where is the joint pdf of the classification feature and the class identity
Similar to the aspect angle estimation problem, the crucial point for this classific
scheme is how good the classification feature is. Actually, the problem of reliabl
estimation in a high dimensional space is now converted to the problem of building
able “information filter” for classification based only on the given training data set
achieve this goal, the information measure of the mutual information is used as als
gested by Fano’s inequality, and the problem of finding an optimal “information fil
can be formulated as
(5.10)
that is to find the optimal parameter set such that the mutual information bet
the classification feature and the class identity is maximized. To implement this
c maxargc
PC X c x( ) maxc
arg fCX x c,( )= =
PC X c x( ) C X
fCX x c,( ) X C
fCX x c,( )
X
y q x w,( )=
w x y
y
c maxc
arg fCY y c,( ) y, q x w,( )= =
fCY y c,( ) Y C
Y
woptimal maxw
I Y q X w,( )= C,( )arg=
woptimal
Y C
154
ing
the 3
y 80).
he
f
the quadratic mutual information based on Euclidean distance and its corresponding
cross information potential will be used again. There will be no assumption made on
either the data or the “information filter.” The only thing used here will be the train
data set itself. In the experiments, it is found that a linear mapping with 3 outputs for
classes is good enough for the classification of such high dimensional images (80 b
The system diagram is shown in Figure 5-12.
Figure 5-12. System Diagram for Classification Information Filter
The joint pdf is still the natural “by-product” of this scheme. Actually, t
cross information potential is based on the Parzen window estimation of the joint pd
(5.11)
where is the variance for Gaussian kernel function for the feature variable ,
is the Kronecker delta function; i.e.,
IED
VED
Images
Angles
x y
a
CrossInformation Potential Field
Forces
Back-Propagation
Information
Class IdentityC
Image X Information Force
Back-Propagation
fCY y c,( )
fCY y c,( ) 1N---- G y yi– σy
2,( ) c ci–( )δi 1=
N
∑=
σy2
y c ci–( )δ
155
ach
an be
lasses
ations
1 and
ression
oal is
t with
(5.12)
So, there is no need to estimate the joint pdf again by any other method. The
ED-QMI information force in this particular case can be interpreted as repulsion among
the “information particles” (IPTs) with different class identity, and attraction with e
other among the IPTs within the same class.
Based on the joint pdf , the Bayes classifier can be built up:
(5.13)
Since the class identity variable is discrete, the search for maximum in (5.13) c
simply implemented by comparing each value of .
5.2.2 Experiment and Result
The experiment is conducted on MSTAR database [Ved97]. There are three c
(vehicles): BMP2, BTR70 and T72. For each one, there are some different configur
(sub-classes) as shown bellow. There are also 2 types of confuser.
BMP2---------BMP2_C21, BMP2_9563, BMP2_9566.
BTR70--------BTR97_C71.
T72-----------T72_132, T72_S7, T72_812.
Confuser-------2S1, D7.
The training data set is composed of 3 types of vehicle: BMP2_C21, BTR70_C7
T72_132 with depression angle 17 degree. All the testing data have 15 degree dep
angle. The classifier is built within the range of 0-30 degree aspect angle. The final g
to combine the result of aspect angle estimation with the target recognition such tha
c ci–( )δ1 c ci =
0 otherwise
=
fCY y c,( )
fCY y c,( )
c maxc
arg fCY y c,( )= y q x w,( )=
C
fCY y c,( )
156
the aspect angle information, the difficult overall recognition task (with all aspect angles)
can be divided and conquered. Since a SAR image of a target is based on the reflection of
the target, different aspect angles may result in quite different characteristics for SAR
imagery. So, organizing classifiers according to aspect angle information is a good strat-
egy.
Figure 5-13 shows the images for training. The classification feature extractor has
three outputs. For the illustration purpose, 2 outputs are used in Figure 5-14, Figure 5-15
and Figure 5-16 to show the output data distribution. Figure 5-14 shows the initial state
with 3 classes mixed up. Figure 5-15 shows the result after several iterations where the
classes are starting to separate. Figure 5-16 shows the output data distribution at the final
stage of the training where 3 classes are clearly separated and each class tends to shrink to
one point.
Figure 5-13. The SAR Images of Three Vehicles for Training Classifier (0-30 degree)
157
tion
tion
Figure 5-14. Initial Output Data Distribution for ClassificationLeft graph: lines are illustration of “information forces;” Right graph: detailed distribu
Figure 5-15. Intermediate Output Data Distribution for ClassificationLeft graph: lines are illustration of “information forces;” Right graph: detailed distribu
158
tion
, the
llow
for two
ble 5-
Figure 5-16. Output Data Distribution at Final Stage for ClassificationLeft graph: lines are illustration of “information forces;” Right graph: detailed distribu
Table 5-3 shows the classification result. With limited number of training data
classifier still shows a very good generalization ability. By setting a threshold to a
10% rejection, a detection test is further conducted on all these data and the data
other confusers. A good result is shown in Table 5-4. The results in Table 5-3 and Ta
Table 5-3. Confusion Matrix for Classification by ED-CIP
BMP2 BTR70 T72
BMP2_C21 18 0 0
BMP2_9563 11 0 0
BMP2_9566 15 0 0
BTR70_C71 0 17 0
T72_132 0 0 18
T72_812 0 2 9
T72_S7 0 0 15
159
4 are obtained by using kernel size and the step size . As a compari-
son, Table 5-5 and Table 5-6 give the corresponding results of the support vector machine
(more detailed results are presented in 1998 image understanding workshop [Pri98]), from
which we can see that the classification result of ED-CIP is even better than that of sup-
port vector machine.
Table 5-4. Confusion Matrix for Detection (with detection probability=0.9) (ED-CIP)
BMP2 BTR70 T72 Reject
BMP2_C21 18 0 0 0
BMP2_9563 11 0 0 2
BMP2_9566 15 0 0 2
BTR70_C71 0 17 0 0
T72_132 0 0 18 0
T72_812 0 2 9 7
T72_S7 0 0 15 0
2S1 0 3 0 24
D7 0 1 0 14
Table 5-5. Confusion Matrix for Classification by Support Vector Machine (SVM)
BMP2 BTR70 T72
BMP2_C21 18 0 0
BMP2_9563 11 0 0
BMP2_9566 15 0 0
BTR70_C71 0 17 0
T72_132 0 0 18
T72_812 5 2 4
T72_S7 0 0 15
σy2
0.1= 5.0 105–×
160
nown
LP
n the
ing to
rithm
onfuse
tion
propa-
gen-
ation
5.3 Training MLP Layer-by-Layer with CIP
During the first neural network era that ended in the 1970s, there was only Rosenb-
latt’s algorithm [Ros58, Ros62] to train one layer perceptron and there was no k
algorithm to train MLPs. However the much higher computational power of the M
when compared with the perceptron was recognized in that period of time [Min69]. I
late 1980s, the back-propagation algorithm was introduced to train MLPs, contribut
the revival of neural computation. Ever since this time, the back-propagation algo
has been exclusively utilized to train MLPs to a point that some researchers even c
the network topology with the training algorithm by calling MLPs as back-propaga
networks. It has been widely accepted that training the hidden layers requires back
gation of errors from the output layers.
As pointed out in Chapter 3, Linsker’s InfoMax can be further extended to a more
eral case. The MLP network can be regarded as a communication channel or “inform
Table 5-6. Confusion Matrix for Detection (with detection probability=0.9) (SVM)
BMP2 BTR70 T72 Reject
BMP2_C21 18 0 0 0
BMP2_9563 11 0 0 2
BMP2_9566 15 0 0 2
BTR70_C71 0 17 0 0
T72_132 0 0 18 0
T72_812 0 1 2 8
T72_S7 0 0 12 3
2S1 0 0 0 27
D7 0 0 0 16
161
nfor-
(3.16),
ut of
tion of
way,
using
ply
trans-
e the
agate
enta-
ed out-
7). A
taps,
e (as
er is
-18.
5-19
r the
each
filter” for each layer. The goal of the training of such network is to transmit as much i
mation about the desired signal as possible at the output of each layer. As shown in
this can be implemented by maximizing the mutual information between the outp
each layer and the desired signal. Notice that we are not using the back-propaga
errors across layers. The network is incrementally trained in a strictly feedforward
from the input layer to the output layer. This may seem impossible since we are not
the information of the top layer to train the input layer. The training in this way is sim
guaranteeing that the maximum possible information about the desired signal is
ferred from the input layer to each layer. The cross information potential can mak
explicit immediate response to each network layer without the need to backprop
from the output layer.
To test the method, the “frequency doubler” problem is selected, which is repres
tive of a nonlinear temporal processing. The input signal is a sinewave and the desir
put signal is still a sinewave but with the frequency doubled (as shown in Figure 5-1
focused TDNN with one hidden layer is used. There are one input node with 5 delay
two nodes in hidden layer with tanh nonlinear function and one linear output nod
shown in Figure 5-17). The ED-QMI or ED-CIP is used for training. The hidden lay
trained first followed by the output layer. The training curves are shown in Figure 5
The output of the hidden nodes and output node after training are shown in Figure
which tells us that the frequency of the final output is doubled. The kernel size fo
training of both the hidden layer and the output layer are for the output of
layer and for the desired signal.
σy2
0.01=
σd2
0.01=
162
This problem can also be solved with MSE criterion and BP algorithm. The error may
be smaller. So, the point here is not to use CIP as a substitute to BP for MLP training. It is
an illustration that the BP algorithm is not the only possible way to train networks with
hidden layers.
From the experimental results, we can see that even without the involvement of the
output layer, CIP can still guide the hidden layer to learn what is needed. The plot of two
hidden node outputs already reveals the doubled frequency which means the hidden nodes
best represent the desired output from the transformation of the input. The output layer
simply selects what is needed. These results, on the other hand, further confirm the valid-
ity of the CIP method proposed.
From the training curves, we can see the sharp increases in CIP which suggest that the
step size should be varied and adapted during the training process. How to choose the ker-
nel size of Gaussian function in CIP method is still an open problem. For these results, it is
determined experimentally.
Figure 5-17. TDNN as a Frequency Doubler
z1–
z1–
z1–
z1–
z1–
X
Y
ZInput Signal Desired Signal
163
Figure 5-18. Training Curve. CIP vs. Iterations
Figure 5-19. The output of the nodes after training
Hidden Layer Output Layer
First Hidden Node Second Hidden Node
Plot the output of two hidden nodes together The output of the network
164
is to
ing.
t this
other.
ection
P is
rent
5.4 Blind Source Separation and Independent Component Analysis
5.4.1 Problem Description and Formulation
Blind source separation is a specific case of ICA. The observed data is a lin-
ear mixture ( is non-singular) of independent source signals
( , independent with each other). There is no further information
about the sources and the mixing matrix. This is why it is called “blind.” The problem
find a projection , so that up to a permutation and scal
Comon [Com94] and Cao and Liu [Cao96] among others have already shown tha
result will be obtained for a linear mixture when the outputs are independent of each
Based on the IP or CIP criteria, the problem can be re-stated as finding a proj
, so that the IP is minimized (maximum quadratic entropy) or CI
minimized (minimum QMI). The system diagram is shown in Figure 5-20. The diffe
cases will be discussed in the following sections.
Figure 5-20. The System Diagram for BSS with IP or CIP
X AS=
A Rm m×∈
S S1 … Sm, ,( )T= Si
W Rm m×∈ Y WX= Y S=
W Rm m×∈ Y WX=
x y IP or CIP Field
Information Force
Back-Propagation
165
5.4.2 Blind Source Separation with CS-QMI (CS-CIP)
As introduced in Chapter 2, CS-QMI can be used as an independence measure. Its cor-
responding cross information potential CS-CIP will be used here for the blind source sep-
aration. For ease of illustration, only 2-source-2-sensor problem is tested. There are two
experiments presented here.
Figure 5-21. Data Distribution for Experiment 1
Figure 5-22. Training Curve for Experiment 1. SNR (dB) vs. iterations
−10 −8 −6 −4 −2 0 2 4 6 8 10 12−10
−5
0
5Source Distribution
−20 −15 −10 −5 0 5 10 15 20 25−10
−5
0
5
10
15Mixed Signal Distribution
−6 −4 −2 0 2 4 6 8−2
−1.5
−1
−0.5
0
0.5
1Recovered Signal Distribution
Source Mixed Signal Recovered
0 100 200 300 400 500 600 700 800 900 10005
10
15
20
25
30
35
40
Iteration
dB
Training Curve. dB vs. iteration
166
Experiment 1 tests the performance of the method on a very sparse data set. Two dif-
ferent colored Gaussian noise segments are used as sources, with 30 data points for each
segment. The data distribution for source signals, mixed signals and recovered signals are
plotted in Figure 5-21. Figure 5-22 is the training curve which shows how the SNR of de-
mixing-mixing product matrix ( ) changes with iteration (SNR approaches to
36.73dB). Both figures show that the method works well.
Figure 5-23. Two Speech Signals from TIMIT Database as Two Source Signals
Figure 5-24. Training Curve for Speech Signals. SNR (dB) vs Iterations
WA
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
x 104
−1
−0.5
0
0.5
1
1.5
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
x 104
−1
−0.5
0
0.5
1
1.5
0 1000 2000 3000 4000 5000 6000 7000 80000
10
20
30
40
50
60
70
80
Sliding Index
dB
Training Curve
167
Experiment 2 uses two speech signals from the TIMIT database as source signals
(shown in Figure 5-23). The mixing matrix is [1, 3.5; 0.8, 2.6] where two mixing direction
[1, 3.5] and [0.8, 2.6] are similar. Whitening is first done on mixed signals. An on-line
implementation is tried in this experiment, in which a short-time window slides over the
speech data. In each window position, speech data within the window are used to calculate
the CS-CIP, related forces and back-propagated forces to adjust the de-mixing matrix. As
the window slides, all speech data will make contribution to the de-mixing and the contri-
butions are accumulated. The training curve (SNR vs. sliding index, SNR approaches to
49.15dB) is shown in Figure 5-24 which tells us that the method converges fast and works
very well. We can even say that it can track the slow change of mixing. Although whiten-
ing is done before the CIP method, we believe that whitening process can also be incorpo-
rated into this method. ED-QMI (ED-CIP) can also be used and similar results have been
obtained.
For the blind source separation, the result is not sensitive to the kernel size for the
cross information potential. A very large range of the kernel size will work, e.g. from 0.01
to 100, etc.
5.4.3 Blind Source Separation by Maximizing Quadratic Entropy
Bell and Sejnowski [Bel95] have shown that a linear network with nonlinear function
at each output node can separate linear mixture of independent signals by maximizing the
output entropy. Here, quadratic entropy and corresponding information potential will be
used to implement the maximum entropy idea for BSS. Again, for the ease of exposition,
only 2-source-2-sensor problem is tested. The source signals are the same speech signals
168
where
5-26).
done
ng fac-
the
re the
from the TIMIT database as above. The mixing matrix is [1 0.8; 3.5 2.78], near singular. It
becomes [-0.5248 0.5273; 0.5876 0.467] after whitening, which is near orthogonal. The
signal scattering plots are shown in Figure 5-25 for both source and mixed signals.
Two narrow line-shape distribution areas can be visually spotted in Figure 5-25 which
correspond to mixing directions. Usually, if such lines are clear, the BSS will be relatively
easier. To test the IP method, a “bad” segment with only 600 samples are chosen,
no obvious line-shaped narrow distribution area can be seen (as shown in Figure
Figure 5-27 shows the mixed signals of this “bad” segment. All the experiments are
only on this “bad” segment.
The parameters used are Gaussian kernel size , initial step size , the decayi
tor of step size , the step size will decay according to where is
time index. Data points in the same “bad segment” are used for training. All results a
iterations from 0 to 10000, ‘tanh’ functions are used in the output space.
Figure 5-25. Signals Scattering Plots
σ2s
α s n( ) s n 1–( )α= n
Source Signals Mixed Signals (after whitening)
169
Figure 5-26. A “bad” Segment of Source Signals
Figure 5-27. The Mixed Signals for the “bad” Segment (after whitening)
Figure 5-28. The Experiment Result. , ,
Waveforms Scattering Plot
Waveforms Scattering Plotlines indicate mixing directions
DDD
DDD
ADD
ADD
Training Curve. Demixing SNR (dB) vs. iterations.
(approaching 27.0956 dB)
Output Signals’ Scattering PlotDDD--desired demixing directionADD--actual demixing direction
σ20.01= s 0.4= α 0.9999=
170
Figure 5-29. The Experiment Result.
Figure 5-30. The Experiment Result. , ,
DDD
DDD
ADD
ADD
Output Singals’ Scattering PlotDDD--desired demixing directionADD--actual demixing direction
Training Curve. Demixing SNR (dB) vs. iterations.
(approaching 24.7210 dB)
σ20.02= s 0.4= α 1.0=
DDD
DDD
ADD
ADD
Output Singals’ Scattering PlotDDD--desired demixing directionADD--actual demixing direction
Training Curve. Demixing SNR (dB) vs. iterations.
(approaching 24.6759 dB)
σ20.02= s 0.2= α 1.0=
171
we’ll
, where
dent
rob-
a per-
onal
lated)
nality
Wu
e are no
as
. Look-
Figure 5-31. The Experiment Result. , ,
5.4.4 Blind Source Separation with ED-QMI (ED-CIP) and MiniMax Method
For simplicity of exposition and without changing the essence of the problem,
discuss only the case with 2 sources and 2 sensors. Figure 5.14 is a mixing model
only are observed. Source signals are statistically indepen
and unknown. Mixing directions and are different and unknown either. The p
lem is to find a demixing system of Figure 5.15 to recover the source signals up to
mutation and scaling. Equivalently, the problem is to find statistically orthog
(independent) directions and rather than geometrically orthogonal (uncorre
directions as PCA [Com94, Cao96, Car98a]. Nevertheless, geometrical orthogo
exists between demixing and mixing directions, e.g. either or .
etal. [WuH98] have shown that even when sources are more than sensors; i.e., ther
statistically orthogonal demixing directions, mixing directions can still be identified
long as there are some signal segments with some sources being zero or near zero
DDD
DDD
ADD
ADD
Output Singals’ Scattering PlotDDD--desired demixing directionADD--actual demixing direction
Training Curve. Demixing SNR (dB) vs. iterations.
(approaching 20.7904 dB)
σ20.01= s 1.0= α 1.0=
x1 t( ) x2 t( ), s1 t( ) s2 t( ),
M1 M2
W1 W2
W1 M1⊥ W1 M2⊥
172
bu-
in
and
ments,
uld be
s treat
g sys-
n be
read
mpose
well
ency
fre-
blem
uency
ing for the mixing directions is therefore more essential than searching demixing direc-
tions and the non-stationarity nature of the sources plays an important role.
(5.14)
(5.15)
From Figure 5.14, if is zero or near zero, the distribution of observed signals in
plane will be along the direction of , forming a “narrow band” data distri
tion, which is good for finding the mixing direction . If and are comparable
energy, the mixing directions will be smeared, which is considered “bad.” Figure 5-25
Figure 5-26 give two opposite examples. Since there are “good” and “bad” data seg
we seek a technique to choose “good” ones while discarding “bad” segments. It sho
pointed out that this issue is rarely addressed in the BSS literature. Most method
data equally and simply apply a criterion to achieve the independence of the demixin
tem outputs. Minimizing ED-CIP can be used for this purpose. In addition, ED-CIP ca
used to distinguish “good” segments from “bad” ones.
Wu etal. [WuH98] utilize the non-stationarity of speech signals and the eigen-sp
of different speech segments to choose “good” segments. However, how to deco
signals in frequency domain to find “good” frequency bands remains obscure. It is
known that an instantaneous mixture will have the same mixture in all the frequ
bands while a convolutive mixture will in general have different mixtures in different
quency bands (therefore, BSS for convolutive mixture is a much more difficult pro
than BSS for instantaneous mixture). For an instantaneous mixture, different freq
x1 t( )
x2 t( ) m11
m12
m21
m22 s1 t( )
s2 t( )
M1s1 t( ) M2s2 t( )+= =
y1 t( )
y2 t( ) w11
w12
w21
w22 T x1 t( )
x2 t( )
W1Tx1 t( ) W2
Tx2 t( )+= =
s2
x1 x2,( ) M1
M1 s1 s2
173
ency
t dif-
prob-
ying
w to
spond-
sca-
or
o, the
and
d
is yes.
is short
a wide
depen-
quiva-
high
ears,
bands may reveal a same mixing direction. So, It is necessary to find “good” frequ
bands by which mixing directions are easier to find. For convolutive mixture, to trea
ferent frequency bands differently may also be important but we’ll only discuss the
lem related to instantaneous mixture here.
Let denote the impulse response of a FIR filter with parameters . Appl
this filter to the observed signals, new observed signals are obtained
(5.16)
Obviously, the mixing directions remain unchanged. The problem here is ho
choose so that only one source signal dominating dynamic range so that the corre
ing mixing direction is clearly revealed.
First, let’s consider the case when mixing matrix where is a positive
lar, is a rotation transform (orthonormal matrix), and mixing directions are near
. Obviously, when there is only one source, and are linear dependent. S
necessary condition to judge a “good” segment is the high dependence between
. But a more important problem is whether the high dependence between an
can guarantee that there is only one dominating filtered source signal. The answer
On one hand, since the source signals are independent, as long as the filter length
enough (frequency band large enough), the filtered source signals will scattered in
region or a narrow one along natural bases (otherwise, the source signals are not in
dent). On the other hand, the mixing is a rotation with about 45 or 135 degrees or e
lent degrees and a narrow band distribution along these directions means the
dependence between two variables. So, if a narrow distribution in plan app
h t π,( ) π
x1′ t( )
x2′ t( )
h t π,( )*x1 t( )
x2 t( )
M1 h*s1( ) M2 h*s2( )+= =
π
M kR= k
R 45o
135o
x1′ x2′
x1′
x2′ x1′ x2′
x1′ x2′,( )
174
, ...,
tputs
l be
5 or
. The
state
edure
it must be the result with only one dominating source signal. To maximize the dependence
between and based on data set where are the parameters
of the filter, is the number of the filtered samples, ED-CIP can be used
(5.17)
where means the FIR filter is constrained with unit norm.
One narrow distribution can be only associated with one mixing direction. Once a
desired filter with parameters and outputs is obtained, the remaining problem is
how to obtain the second, the third etc. so that the narrow distribution associated with
another mixing direction will appear. One idea is to let the outputs of the filter be highly
dependent with each other and at the same time be independent with all the outputs of pre-
vious filters, e.g. where is a
weight and can change from 0 to 0.5 or to 1. After several “good” data set
are obtained, the demixing can be found by minimizing the ED-CIP of the ou
of demixing on all chosen data set:
(5.18)
This is why the method is called the “Mini-Max” method.
If mixing is not a rotation, whitening can be done so that the mixing matrix wil
close to mentioned above. If the mixing directions (after whitening) are far from 4
135 degree direction, a rotation transform can be further introduced before filters
parameters of rotation will be trained by the same criterion and will converge to the
where the overall mixing direction is near 45 or 135 degree direction. So the proc
x1′ x2′ x′ π i,( ) i 1= … N, , , π
N
πoptimal max VED x′ π i,( ) i 1= … N, , , ( )( )argπ 1=
=
π 1=
π1 x1′
π2optimal maxarg µVc x2′( ) 1 µ–( )Vc x2′ x1′,( )–[ ]=π2 1=
µ
x1′
xn′
yi W xi′ i 1 … n, ,= =
Woptimal minargW
Vc y1( ) … Vc yn( )+ +[ ]=
M
kR
175
here
ng is
ignals
an see
due
s in
e con-
Based
s are
will be 1) whitening; 2) training the parameters of a rotation transform; 3) training the
parameters of filters.
Since mixing directions can be identified easily by narrow scattering (distribution),
this method is also expected to enhance the demixing performance when the observation
is corrupted by noise; i.e., .
The same “bad” segment and mixing matrix as the previous section will be used
(shown in Figure 5-26). Whitening is first done, and the mixed signals after whiteni
shows in Figure 5-27. White Gaussian noise (SNR=0dB) is added into the mixed s
and make even a worse segment (shown in Figure 5-32). From Figure 5-27, we c
that the mixing directions are difficult to find. The case in Figure 5-32 is even worse
to the noise.
Figure 5-32. The “bad” Segment in Figure 5-27 + Noise (SNR=0dB)
By directly minimizing ED-CIP of the outputs of a demixing system, the result
Figure 5-33 is obtained, from which we can see the average demixing performanc
verges to 32.18dB for the case without noise, and 15.20dB for the case with noise.
only on the limited number of data points in the “bad segment” (first 400 data point
x M1s1 M2s2 Noise+ +=
Waveforms Scattering Plotlines indicate mixing directions
176
used), Mini-ED-CIP method can still get a good performance (Comparing the results with
the results by IP method in the previous section). This further verifies the validity of ED-
CIP. By applying Max-ED-CIP method to train FIR filters, we get results shown in Figure
5-34 and Figure 5-35, where frequency bands with only one dominating source signal are
found, and the scattering distributions of the outputs of those filters match with mixing
directions. Mini-Max-ED-CIP is further applied to these results to find demixing system,
obtaining improved 38.50dB average demixing performance for the case without noise,
and 24.39dB for the case with noise (Figure 5-36 and Figure 5-7).
In this section, it is pointed out that finding mixing directions is more essential than
obtaining demixing directions. Maximizing ED-CIP can help to obtain frequency bands in
which mixing directions are easier to find. Mini-Max-ED-CIP method can improve the
demixing performance over Mini-ED-CIP method. Although the experiments presented
here are specific ones, they further confirms the effectiveness of ED-CIP method. The
work on Mini-Max-ED-CIP is preliminary, but it suggests the other extreme (maximizing
mutual information) for BSS compared with all the current methods (minimizing mutual
information). As ancient philosophy suggests, two opposite extremes can often exchange.
It is worthwhile to explore this direction for BSS and even blind deconvolution.
Table 5-7. Demixing Performance Comparison
The case without noise The case with noise
Mini-CIP 32.18 dB 15.20 dB
Mini-Max-CIP 38.50 dB 24.39 dB
177
Figure 5-33. Performance by Minimizing ED-CIP
Figure 5-34. The results of filters FIR 1 and FIR 2 obtained by Max ED-CIP (the case without noise)
the case without noise demixing SNR approaching 32.18 dB
the case with noise demixing SNR approaching 15.20 dB
(a) (b) (c) (d)
(a): distribution of the outputs of FIR 1 (c): distribution of the outputs of Filter 2
(b): source signals filtered by FIR 1 ratio of two signals: from -0.87dB to 13.21dB
(d): source signals filtered by FIR 2 ratio of two signals: from -0.87dB to -19.86dB
178
Figure 5-35. The results of filters FIR 3 and FIR 4 obtained by Max ED-CIP (with noise)
Figure 5-36. The Performance by Mini-Max ED-CIP
(a) (b) (c) (d)
(a): distribution of the outputs of FIR 3 (c): distribution of the outputs of Filter 4
(b): source signals filtered by FIR 3 ratio of two signals: from -0.87dB to 13.02dB
(d): source signals filtered by FIR 4 ratio of two signals: from -0.87dB to -13.84dB
the case without noise the case with noisedemixing SNR approaching 38.50dB demixing SNR approaching 24.39dB
n can
tropy,
d pro-
res, the
The
her to
r sig-
eneral
ting
gen-
iple.
re the
with
h is a
CHAPTER 6
CONCLUSIONS AND FUTURE WORK
In this chapter, we would like to summarize the issues addressed in this dissertation
and the contributions we made towards their solutions. The initial goal is to establish a
general nonparametric method for information entropy and mutual information estimation
based only on data samples, without any other assumption. From a physical point of view,
the world is a “mass-energy” system. It turns out that entropy and mutual informatio
also be viewed from this point of view. Based on the other general measure for en
such as Renyi’s entropy, we interpret entropy as a rescaled norm of a pdf function an
posed the idea of the quadratic mutual information. Based on these general measu
concepts of “information potential” and “cross information potential” are proposed.
ordinary energy definition for a signal and the proposed IP and CIP are put toget
give a unifying point of view about these fundamental measures which are crucial fo
nal processing and adaptive learning in general. With such fundamental tool, a g
information-theoretic learning framework is given which contains all the current exis
information-theoretic learning as a special case. More importantly, we not only give a
eral learning principle, but also give an effective implementation of this general princ
We break the barrier of model linearity and Gaussian assumption on data which a
major limitation of the most existing methods. In Chapter 4, a case study on learning
on-line local rule is presented. We establish the link between the power field, whic
179
180
” data
may
special case of the information potential field, to the famous biological learning rules: the
Hebbian and the anti-Hebbian rules. Based on these basic understanding, we developed an
on-line local learning algorithm for the generalized eigendecomposition for signals. Simu-
lations and experiments of these methods are conducted on several problems, such as
aspect angle estimation for SAR imagery, target recognition, layer-by-layer training of
multilayer neural networks, blind source separation. The results further confirm the pro-
posed methodology.
The major problem left is the further theoretic justification of the quadratic mutual
information. The basis for the QMI as an independence measure is strong. We further pro-
vide some intuitive arguments that it is also appropriate as a dependence measure and we
apply the criteria successfully to solve several problems. However, there is still no rigor-
ous theoretical proof that the QMI is appropriate for mutual information maximization.
The problem of the on-line learning with IP or CIP is mentioned in Chapter 4. Since IP
or CIP examines such detailed information as the relative position of each pair of data
samples, it is very difficult to design an on-line algorithm for IP and CIP. The on-line rule
for an energy measure is relatively easy to obtain because it only examines the relative
position of each data sample to their mean point. Thus, each data point is relatively inde-
pendent with each others while IP or CIP need to take care the relation of each data sample
to all the others. One solution to this problem may come from the use of the mixture model
where the means for subclasses of all data are used. Then the relative position between
each data sample and each subclass mean need to be considered. Each mean may just like
a “heavy” data point with more “mass” than an ordinary data sample. These “heavy
points may serve as a kind of memory in a learning process. The IP or CIP then
181
parti-
n mix-
ols to
e
re are
y be
ossi-
CIP
during
ut ker-
lect the
ctical
become the IP or CIP of each sample in the IP or CIP field of these “heavy mean
cles.” Based on this scheme, an on-line algorithm may be developed. The Gaussia
ture model and the EM algorithm mentioned in Chapter 3 may be the powerful to
obtain such “heavy information particles.”
The computational complexity of IP or CIP method is in the order of wher
is the number of data samples. With the “heavy information particles” (suppose the
such “particles” and and may be fixed), the computational complexity ma
reduced to the order of . So, it may be very significant to further study this p
bility.
In terms of algorithmic implementation, how to choose the kernel size for IP and
is not discussed in the previous Chapters. We empirically choose the kernel size
our experiments. It has been observed that the CIP is not sensitive to kernel size, b
nel size may be crucial for the IP. Further study on this issue or even a method to se
optimal kernel size is important for the IP and the CIP methods.
The IP and the CIP methods are general. They may find many applications in pra
problems. To find more applications will also be an important work in the future.
O N2( ) N
M M N«
O MN( )
APPENDIX A
THE INTEGRATION OF THE PRODUCT OF GAUSSIAN KERNELS
Let be the Gaussian function in dimen-
sional space, where is the covariance matrix, . Let and be two
data points in the space, and be two covariance matrices for two Gaussian kernels
in the space, then we have
(A.1)
Similarly, the integration of the product of three Gaussian kernels can also be
obtained. The following is the proof of (A.1).
Proof:
1. Let , then (A.1) becomes
(A.2)
2. Let , then we have
(A.3)
Actually, by the matrix inversion lemma [Gol93]
G y Σ,( ) 1
2π( )k 2⁄ Σ 1 2⁄--------------------------------- 1
2---y
TΣ 1–y–
exp= k
Σ y Rk∈ ai R
k∈ aj Rk∈
Σ1 Σ2
G y ai– Σ1,( )G y aj– Σ2,( ) yd
∞–
+∞
∫ G ai aj–( ) Σ1 Σ2+( ),( )=
d ai aj–=
G y ai– Σ1,( )G y aj– Σ2,( ) yd
∞–
+∞
∫ G y d– Σ1,( )G y Σ2,( ) yd
∞–
+∞
∫=
G d Σ1 Σ2+( ),( )=
c Σ11– Σ2
1–+( )
1–Σ1
1–d=
y d–( )TΣ11–
y d–( ) yTΣ2
1–y+
y c–( )T Σ11– Σ2
1–+( ) y c–( ) d
T Σ1 Σ2+( ) 1–d+=
182
183
(A.4)
and let , and (identity matrix), we have
(A.5)
Since and are all symmetric, we have
(A.6)
3. Since (if exists) and (if exists), we have
(A.7)
4. Based on (A.3) and (A.7), we have
(A.8)
Actually, by applying (A.3) and (A.7), we have
A CBCT
+( )1–
A1–
A1–C B
1–C
TA
1–C+( )
1–C
TA
1––=
A Σ1= B Σ2= C I=
Σ11– Σ1
1– Σ21– Σ1
1–+( )
1–Σ1
1–– Σ1 Σ2+( ) 1–
=
Σ1 Σ2
y d–( )TΣ11–
y d–( ) yTΣ2
1–y+
yTΣ1
1–y 2d
TΣ11–y– d
TΣ11–d y
TΣ21–y+ +=
yT Σ2
1– Σ11–
+( )y 2cT Σ2
1– Σ11–
+( )y– cT Σ2
1– Σ11–
+( )c+=
cT Σ2
1– Σ11–
+( )c dTΣ1
1–d+–
y c–( )T Σ21– Σ1
1–+( ) y c–( ) d
TΣ11– Σ1
1– Σ21–
+( )1–Σ1
1–d d
TΣ11–d+–=
y c–( )T Σ21– Σ1
1–+( ) y c–( ) d
T Σ11– Σ1
1– Σ11– Σ2
1–+( )
1–Σ1
1––[ ]d+=
y c–( )T Σ11– Σ2
1–+( ) y c–( ) d
T Σ1 Σ2+( ) 1–d+=
A B AB= AB A1–
A1–
= A1–
Σ1 Σ2+ Σ11– Σ2
1–+
1–
Σ1 Σ2----------------------------------------------------
Σ11– Σ2
1– Σ1 Σ2+ Σ11– Σ2
1–+
1–=
Σ11– Σ2
1– Σ1 Σ2+ Σ11– Σ2
1–+
1–=
Σ11– Σ2
1–+ Σ1
1– Σ21–
+1–
=
1=
G y d– Σ1,( )G y Σ2,( )
G y c– Σ21– Σ1
1–+( )
1–,( )G d Σ1 Σ2+( ),( )=
184
(A.9)
5. Since is the Gaussian pdf function and its integration equals to 1, we have
(A.10)
So, (A.2) is proved and equivalently (A.1) is proved.
G y d– Σ1,( )G y Σ2,( )
1
2π( )k Σ11 2⁄ Σ2
1 2⁄----------------------------------------------- 1
2--- y d–( )TΣ1
1–y d–( ) y
TΣ21–y+[ ]–
exp=
1
2π( )k Σ11 2⁄ Σ2
1 2⁄----------------------------------------------- 1
2--- y c–( )T Σ1
1– Σ21–
+( ) y c–( ) dT Σ1 Σ2+( ) 1–
d+[ ]– exp=
G y c– Σ21– Σ1
1–+( )
1–,( )G d Σ1 Σ2+( ),( )
Σ1 Σ2+ Σ11– Σ2
1–+
1–
Σ1 Σ2----------------------------------------------------
1 2⁄
=
G y c– Σ21– Σ1
1–+( )
1–,( )G d Σ1 Σ2+( ),( )=
G ,( )
G y d– Σ1,( )G y Σ2,( ) yd
∞–
+∞
∫
G y c– Σ21– Σ1
1–+( )
1–,( )G d Σ1 Σ2+( ),( ) yd
∞–
+∞
∫=
G d Σ1 Σ2+( ),( ) G y c– Σ21– Σ1
1–+( )
1–,( ) yd
∞–
+∞
∫=
G d Σ1 Σ2+( ),( )=
185
APPENDIX B
SHANNON ENTROPY OF MULTI-DIMENSIONAL GAUSSIAN VARIABLE
For a Gaussian random variable with pdf function
, where is the mean and is the
covariance matrix, Shannon’s information entropy is
(B.1)
Proof:
(B.2)
where is the trace operator.
X Rk∈
fX x( ) 1
2π( )k 2⁄ Σ 1 2⁄--------------------------------- 1
2--- x µ–( )TΣ 1–
x µ–( )– exp= µ Σ
Hs X( ) 12--- Σlog
k2--- 2π k
2---+log+=
Hs X( ) E fX x( )log–[ ]=
Ek2--- 2π( )log
12--- Σlog
12---X
TΣ 1–X+ +=
12--- Σ k
2--- 2πlog
12---E tr X
TΣ 1–X( )[ ]+ +log=
12--- Σ k
2--- 2πlog
12---E tr XX
TΣ 1–( )[ ]+ +log=
12--- Σ k
2--- 2πlog
12---tr E XX
TΣ 1–( )[ ]+ +log=
12--- Σ k
2--- 2πlog
12---tr E XX
T( )Σ 1–[ ]+ +log=
12--- Σ k
2--- 2πlog
12---tr I[ ]+ +log=
12--- Σ k
2--- 2πlog
k2---+ +log=
tr[ ]
186
APPENDIX C
RENYI ENTROPY OF MULTI-DIMENSIONAL GAUSSIAN VARIABLE
For a Gaussian random variable with pdf function
, where is the mean and is the
covariance matrix, Renyi’s information entropy is
(C.1)
Proof: using (A.1), we have
(C.2)
: (C.3)
X Rk∈
fX x( ) 1
2π( )k 2⁄ Σ 1 2⁄--------------------------------- 1
2--- x µ–( )TΣ 1–
x µ–( )– exp= µ Σ
HRα X( ) 12--- Σlog
k2--- 2π k
2--- αlog
α 1–------------
+log+=
fX x( )αxd
∞–
+∞
∫ G x µ– Σ,( )α 2⁄G x µ– Σ,( )α 2⁄
xd∞–
+∞
∫=
2π( )k 1 α 2⁄–( ) 2α---
kΣ 1 α 2⁄–( )
G x µ–2α---Σ,
G x µ–2α---Σ,
xd∞–
+∞
∫=
2π( )k 1 α 2⁄–( ) 2α---
kΣ 1 α 2⁄–( )
G 04α---Σ,
=
2π( )k 1 α 2⁄–( ) 2α---
kΣ 1 α 2⁄–( )
2π( )k 2⁄ 4α---Σ
1 2⁄----------------------------------------------------------------------=
2π( )k2--- 1 α–( )
α k2---–
Σ12--- 1 α–( )
=
HRα X( ) 11 α–------------ fX x( )α
xd∞–
+∞
∫log1
1 α–------------ 2π( )
k2--- 1 α–( )
α k2---–
Σ12--- 1 α–( )
log== =
12--- Σlog
k2--- 2π k
2--- αlog
α 1–------------
+log+=
187
APPENDIX D
H-C ENTROPY OF MULTI-DIMENSIONAL GAUSSIAN VARIABLE
For a Gaussian random variable with pdf function
, where is the mean and is the
covariance matrix, Havrda-Charvat’s information entropy is
(D.1)
Proof: using (C.2), we have
(D.2)
X Rk∈
fX x( ) 1
2π( )k 2⁄ Σ 1 2⁄--------------------------------- 1
2--- x µ–( )TΣ 1–
x µ–( )– exp= µ Σ
Hhα X( ) 11 α–------------ 2π( )
k2--- 1 α–( )
α k2---–
Σ12--- 1 α–( )
1–
=
Hhα X( ) 11 α–------------ fY y( )α
yd
∞–
+∞
∫ 1–
11 α–------------ 2π( )
k2--- 1 α–( )
α k2---–
Σ12--- 1 α–( )
1–
= =
ta-
ical
ural50,
Neu-ci-
vers-163,
ndpp.
on,”
Pro-ignal
998,
REFERENCES
[Ace92] A. Acero, Acoustical and Environmental Robustness in Automatic Speech Rec-ognition, Kluwer Academic Publishers, Boston, 1992.
[Acz75] J. Aczel, Z. Daroczy, On Measures of Information and Their Characterizations,Academic Press, New York, 1975.
[Ama98] S. Amari, “Natural Gradient Works Efficiently in Learning,” Neural Compution, Vol.10, No.2, pp.251-176, February, 1998.
[Att54] F. Attneave, “Some Informational Aspects of Visual Perception,” PsychologReview, Vol.61, pp.183-193, 1954.
[Bat94] R. Battiti, “Using Mutual Information for Selecting Features in Supervised NeNet Learning,” IEEE Transactions on Neural Networks, Vol.5, No.4, pp.537-5July, 1994.
[Bec89] S. Becker and G.E. Hinton, “Spatial Coherence as an Internal Teacher for aral Network,” Technical Report GRG-TR-89-7, Department of Computer Sence, University of Toronto, Ontario, 1989.
[Bec92] S. Becker and G.E. Hinton, “A Self-Organizing Neural Network That DiscoSurfaces in Random-dot Stereograms,” Nature (London), Vol.355, pp.1611992.
[Bel95] A. J. Bell and T. J. Sejnowski, “An Information-Maximization Approach to BliSeparation and Blind Deconvolution,” Neural Computation, Vol.7, No.6, 1129-1159, November, 1995.
[Car97] J.-F. Cardoso, “Infomax and Maximum Likelihood for Blind Source SeparatiIEEE Signal Processing Letters, Vol.4, No.4, pp.112-114, April, 1997.
[Car98a] J. F. Cardoso, “Multidimensional Independent Component Analysis,” theceedings of 1998 IEEE International Conference on Acoustic, Speech and SProcessing, pp.1941-1944, Seattle, 1998.
[Car98b] J.-F. Cardoso, “Blind Signal Separation: A Review,” Proceedings of IEEE, 1to appear.
188
189
EE
sity
rga- on
tion,
First
Pro-tatis-
.20,
om-
myal
ey,
for
eory
Wiley
lysis,
[Cao96] X-R. Cao, R-W. Liu, “General Approach to Blind Source Separation,” IETransactions on Signal Processing, Vol.44, pp.562-571, March, 1996.
[Cha87] D. Chandler, Introduction to Modern Statistical Mechanics, Oxford UniverPress, New York, 1987.
[Cha97] C. Chatterjee, V. P. Roychowdhury, J. Ramos and M. D. Zoltowski, “Self-Onizing Algorithms for Generalized Eigen-decomposition,” IEEE TransactionsNeural Networks, Vol.8, No.6, pp1518-1530, November, 1997.
[Chr80] R. Christensen, Entropy MiniMax Sourcebook, Vol.3, Computer ImplementaFirst Edition, Entropy Limited, Lincoln, MA, 1980.
[Chr81] R. Christensen, Entropy MiniMax Sourcebook, Vol.1, General Description, Edition, Entropy Limited, Lincoln, MA, 1981.
[Com94] P. Comon, “Independent Component Analysis, A New Concept?” Signal cessing, Vol.36. pp.287-314, April, 1994, Special Issue on Higher-Order Stics.
[Cor95] C. Cortes and V. Vapnik, “Support-Vector Networks,” Machine Learning, VolNo.3, pp.273-297, 1995.
[Dec96] G. Deco and D. Obradovic, An Information-Theoretic Approach to Neural Cputing, Springer, New York, 1996.
[Dem77] A. P. Dempster, N. M. Laird and D. B. Rubin, “Maximum Likelihood froIncomplete Data via the EM Algorithm (with Discussion),” Journal of the RoStatistical Society B, Vol.39, pp.1-38, 1977.
[Dev85] L. Devroye and L. Gyorfi, Nonparametric Density Estimation in L1 View, WilNew York, 1985.
[deV92] B. deVries and J. C. Principe, “The Gamma Model--A New Neural ModelTemporal Processing,” Neural Networks, Vol.5. pp.565-576, 1992.
[Dia96] K. I. Diamantaras and S. Y. Kung, Principal Component Neural Networks, Thand Applications, John Wiley & Sons, Inc, New York, 1996.
[Dud73] R. O. Duda and P. E. Hart, Pattern Classification and Scene Analysis, John& Sons, New York, 1973.
[Dud98] R. Duda, P. E. Hart and D. G. Stork, Pattern Classification and Scene AnaPreliminary Preprint Version, to be published by John Wiley & Sons, Inc.
190
ergyring,
akel.1,
ss,
kins
om-
al,
am-
sses:
sey,
od
ition,
iley,
eo--14,
rlag,
[Fis97] J. W. Fisher, “Nonlinear Extensions to the Minimum Average Correlation EnFilter,” Ph.D dissertation, Department of Electrical and Computer EngineeUniversity of Florida, Gainesville, 1997.
[Gal88] A. R. Gallant and H. White, “There Exists a Neural Network That Does Not MAvoidable Mistakes,” IEEE International Conference on Neural Network, Vopp.657-664, San Diego, 1988.
[Gil81] P. E. Gill, W. Murray and M. H. Wright, Practical Optimization, Academic PreNew York, 1981.
[Gol93] G. Golub and C. Van Loan, Matrix Computations, second edition, John HopUniversity Press, Baltimore, 1993.
[Hak88] H. Haken, Information and Self-Organization: A Macroscopic Approach to Cplex Systems, Springer-Verlag, New York, 1988.
[Har28] R. V. Hartley, “Transmission of information,” Bell System Technical JournVol.7, pp.535-563, 1928.
[Har34] G. H. Hardy, J. E. Littlewood and G. Polya, Inequalities, University Press, Cbridge, 1934.
[Hav67] J.H. Havrda and F. Charvat, “Quantification Methods of Classification ProceConcept of Structural Entropy,” Kybernatica, Vol.3, pp.30-35, 1967.
[Hay94] S. Haykin, Neural Networks, A Comprehensive Foundation, Macmillan Publish-ing Company, New York, 1994.
[Hay94a] S. Haykin, Blind Deconvolution, Prentice Hall, Englewood Cliffs, New Jer1994.
[Hay96] S. Haykin, Adaptive Filter Theory, Third Edition, Prentice Hall, EnglewoCliffs, NJ, 1996.
[Hay98] S. Haykin, Neural Networks: A Comprehensive Foundation, Second EdPrentice Hall, Englewood Cliffs, NJ, 1998.
[Heb49] D.O. Hebb, The Organization of Behavior: A Neuropsychological Theory, WNew York:, 1949
[Hec87] R. Hecht-Nielsen, “Kolmogorov’s Mapping Neural Network Existence Threm,” 1st IEEE International Conference on Neural networks, Vol.3, pp.11San Diego, 1987.
[Hes80] M. Hestenes, Conjugate Direction Methods in Optimization, Springer-Ve
α
191
ical
ela-6.
rlag,
ica-
&
New
iloso-
ew
nenting,
Net-52,
21,
New York, 1980.
[Hon84] M. L. Honig and D. G. Messerschmitt, Adaptive Filters: Structures, Algorithms,and Applications, Kluwer Academic Publishers, Boston, 1984.
[Hua90] X. D. Huang, Y. Ariki and M.A. Jack, Hidden Markov Models for Speech Recog-nition, University Press, Edinburgh, 1990.
[Jay57] E.T. Jaynes, “Information Theory and Statistical Mechanics, I, II,” PhysReview Vol.106, pp.620-630, and Vol.108, pp.171-190, 1957.
[Jum86] G. Jumarie, Subjectivity, Information, Systems: Introduction to a Theory of Rtivistic Cybernetics, Gordon and Breach Science Publishers, New York, 198
[Jum90] G. Jumarie, Relative Information: Theories and Applications, Springer-VeNew York, 1990.
[Kap92] J. N. Kapur and H. K. Kesavan, Entropy Optimization Principles with Appltions, Academic Press, Inc., New York, 1992.
[Kap94] J.N. Kapur, Measures of Information and Their Applications, John WileySons, New York, 1994.
[Kha92] H. K. Khalil, Nonlinear Systems, Macmillan, New York, 1992.
[Kol94] J. E. Kolassa, Series Approximation Methods in Statistics, Springer-Verlag, York, 1994
[Kub75] L. Kubat and J. Zeman (Eds.), Entropy and Information in Science and Phphy, Elsevier Scientific Publishing Company, Amsterdam, 1975.
[Kul68] S. Kullback, Information Theory and Statistics, Dover Publications, Inc., NYork, 1968.
[Kun94] S. Y. Kung, K. I. Diamantaras and J. S. Taur, “Adaptive Principal CompoEXtraction (APEX) and Applications,” IEEE Transactions on Signal ProcessVol. 42, No. 5, pp.1202-1217, May, 1994.
[Lan88] K. J. Lang and G. E. Hinton, “The Development of the Time-Delay Neural work Architecture for Speech Recognition,” Technical Report CMU-CS-88-1Carnegie-Mellon University, Pittsburgh, PA, 1988.
[Lin88] R. Linsker, “Self-Organization in a Perceptual Network,” Computer, Vol.pp.105-117, 1988.
192
a-ms ICA,
ndl.6,
ons
iley
tternems:87-
69.
ix-
ood
ical
ur-
983.
d Edi-
de,”
unc-
Inde-
[Lin89] R. Linsker, “An Application of the Principle of Maximum Information Preservtion to Linear Systems,” In Advances in Neural Information Processing Syste(edited by D.S. Touretzky), pp.186-194, Morgan Kaufmann, San Mateo, 1989.
[Mao95] J. Mao and A. K. Jain, “Artificial Neural Networks for Feature Extraction aMultivariate Data Projection,” IEEE Transactions on Neural Network, VoNo.2, pp.296-317, March, 1995.
[Mcl88] G. J. McLachlan and K. E. Basford, Mixture Models: Inference and Applicatito Clustering, Marcel Dekker, Inc., New York, 1988.
[Mcl96] G. J. McLachlan and T. Krishnan, The EM Algorithm and Extensions, John W& Sons, Inc., New York, 1996.
[Men70] J. M. Mendel and R. W. McLaren, “Reinforcement-Learning Control and PaRecognition Systems,” in Adaptive, Learning, and Pattern Recognition SystTheory and Applications, Vol. 66, (edited by J.M.Mendel and K.S.Fu), pp.2318, Academic Press, New York, 1970.
[Min69] M. L. Minsky and S. A. Papert, Perceptrons, MIT Press, Cambridge, MA, 19
[Ngu95]. H. L. Nguyen and C. Jutten, “Blind Sources Separation for Convolutive Mtures,” Signal Processing, Vol.45, No.2, pp.209-229, August, 1995.
[Nob88] B. Noble and J. W. Daniel, Applied Linear Algebra, Prentice-Hall, EnglewCliffs, NJ, 1988.
[Nyq24] H. Nyquist, “Certain Factors Affecting Telegraph Speed,” Bell System TechnJournal, Vol.3, pp.332-333, 1924.
[Oja82] E. Oja, “A Simplified Neuron Model as a Principal Component Analyzer,” Jonal of Mathematical Biology, Vol.15, pp.267-273, 1982.
[Oja83] E. Oja, Subspace Methods of Pattern Recognition, John Wiley, New York, 1
[Pap91] A. Papoulis, Probability, Random Variables, and Stochastic Processes, Thirtion, McGraw-Hill, Inc., New York, 1991.
[Par62] E. Parzen, “On the Estimation of a Probability Density Function and the MoAnn. Math. Stat., Vol.33, pp.1065-1076, 1962.
[Par91] J. Park and I. W. Sandberg, “Universal Approximation Using Radial-Basis-Ftion Networks,” Neural Computation, Vol.3, pp246-257, 1991.
[Pha96] D. T. Pham, “Blind Separation of Instantaneous Mixture of Sources via an
193
l.44,
su-od-
ski),
ed-
News on
imi-onalunich,
lineeu-.
sel.2,
e Hall,
cted6.
rs of
rage958.
Brain
sing:A,
pendent Component Analysis,” IEEE Transactions on Signal Processing, VoNo.11, pp.2768-2779, November, 1996.
[Plu88] M. D. Plumbley and F. Fallside, “An Information-Theoretic Approach to Unpervised Connectionist Models,” in Proceedings of the 1988 Connectionist Mels Summer School (edited by D. Touretzky, G. Hinton and T. Sejnowpp.239-245, Morgan Kaufmann, San Mateo, CA, 1988.
[Pog90] T. Poggio and F. Girosi, “Networks for Approximation and Learning,” Proceings of the IEEE, Vol.78, pp.1481-1497, 1990.
[Pri93] J. C. Principe, B. deVries and P. Guedes de Oliveira, “The Gamma Filters: AClass of Adaptive IIR Filters with Restricted Feedback,” IEEE TransactionSignal Processing, Vol.41, No.2, pp.649-656, 1993.
[Pri97a] J. C. Principe, D. Xu and C. Wang, “Generalized Oja’s Rule for Linear Discrnant Analysis with Fisher Criterion,” the proceedings of 1997 IEEE InternatiConference on Acoustic, Speech and Signal Processing, pp3401-3404, MGermany, 1997.
[Pri97b] J. C. Principe and D. Xu, “Classification with Linear networks Using an On-Constrained LDA Algorithm,” Proceedings of the 1997 IEEE Workshop on Nral Networks for Signal Processing VII, pp.286-295, Amelia Island, FL, 1997
[Pri98] J. C. Principe, Q. Zhao and D. Xu, “A Novel ATR Classifier Exploiting PoInformation,” Proceedings of 1998 Image Understanding Workshop, Vopp.833-838, Monterey, California, 1998.
[Rab93] L. Rabiner and B. H. Juang, Fundamentals of Speech Recognition, PrenticEnglewood Cliffs, NJ, 1993.
[Ren60] A. Renyi, “Some Fundamental Questions of Information Theory,” in SelePapers of Alfred Renyi, Vol. 2, pp.526-552, Akademiai Kiado, Budapest, 197
[Ren61] A. Renyi, “On Measures of Entropy and Information,” in Selected PapeAlfred Renyi, Vol. 2. pp.565-580, Akademiai Kiado, Budapest, 1976.
[Ros58] F. Rosenblatt, “The Perceptron: A Probabilistic Model for Information Stoand Organization in the Brain,” Psychological Review, Vol.65, pp.386-408, 1
[Ros62] R. Rosenblatt, Principles of Neurodynamics: Perceptron and Theory of Mechanisms, Spartan Books, Washington DC, 1962.
[Ru86a] D. E. Rumelhart and J. L. McClelland, eds., Parallel Distributed ProcesExplorations in the Microstructure of Cognition, MIT Press, Cambridge, M1986.
194
s of
enta-er 8,
ech-
ation,
man
n,
for-ectri-2.
rk,
forstem
Rec-ics,
rcingonal
y of
o It,”
[Ru86b] D. E. Rumelhart, G. E. Hinton and R. J. Williams, “Learning RepresentationBack-Propagation Errors,” Nature (London), Vol.323, pp.533-536, 1986.
[Ru86c] D. E. Rumelhart, G. E. Hinton and R. J. Williams, “Learning Internal Represtions by Error Propagation,” in Parallel Distributed Processing, Vol.1, ChaptMIT Press, Cambridge, MA, 1986.
[Sha48] C. E. Shannon, “A Mathematical Theory of Communication,” Bell System Tnical Journal, Vol.27, pp.379-423, pp.623-653, 1948.
[Sha62] C. E. Shannon and W. Weaver, The Mathematical Theory of CommunicUniversity of Illinois Press, Urbana, 1962.
[Sil86] B. W. Silverman, Density Estimation For Statistics and Data Analysis, Chapand Hall, New York, 1986.
[Tri71] M. Tribus and E.C. Mclrvine, “Energy and Information,” Scientific AmericaVol.225, September, 1971.
[Ukr92] A. Ukrainec and S. Haykin, “Enhancement of Radar Images Using Mutual Inmation Based Unsupervised Neural Network,” Canadian Conference on Elcal and Computer Engineering, pp.MA6.9.1-MA6.9.4, Toronto, Canada, 199
[Vap95] V. N. Vapnik, The Nature of Statistical Learning Theory, Springer, New Yo1995
[Ved97] Veda Incorporated, MSTAR data set, 1997.
[Vio95] P. Viola, N. Schraudolph and T. Sejnowski, “Empirical Entropy Manipulation Real-World Problems,” Proceedings of Neural Information Processing Sy(NIPS 8) Conference, pp.851-857, Denver, Colorado, 1995.
[Wai89] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano and K. J. Lang, “Phoneme ognition Using Time-Delay Neural Networks,” IEEE Transactions on AcoustSpeech and Signal Processing, Vol. ASSP-37, pp.328-339, 1989.
[Wan96] C. Wang, H. Wu and J. Principe, “Correlation Estimation Using Teacher FoHebbian Learning and Its Application,” in Proceedings 1996 IEEE InternatiConference on Neural Networks, pp.282-287, Washington DC, June, 1996.
[Weg72] E. J. Wegman, “Nonparametric Probability Density Estimation: I. A SummarAvailable Methods,” Technometrics, Vol.14, No.3, August, 1972.
[Wer90] P. J. Werbos, “Backpropagation Through Time: What It Does and How to DProceedings of the IEEE, Vol.78, pp.1550-1560, 1990.
195
g80,
inel.2,
ro-IEEEVol.2,
anduter
al844,
oseVol
an98-
aly-ed-21-
dral
[Wid63] B. Widrow, A Statistical Theory of Adaptation, Pergamon Press, Oxford, 1963.
[Wid85] B. Widrow, Adaptive Signal Processing, Prentice-Hall, Englewood Cliffs, NewJersey, 1985.
[Wil62] S. S. Wilks, Mathematical Statistics, John Wiley & Sons, Inc, New York, 1962.
[Wil89] R. J. Williams and D. Zipser, “A Learning Algorithm for Continually RunninFully Recurrent Neural Networks,” Neural Computation, Vol.1. pp.270-21989.
[Wil90] R. J. Williams and J. Peng, “An Efficient Gradient-Based Algorithm for On-LTraining of Recurrent Network Trajectories,” Neural Computation, Vopp.490-501, 1990.
[WuH98] H.-C. Wu, J. Principe and D. Xu, “Exploring the Tempo-Frequency MicStructure of Speech for Blind Source Separation,” Proceedings of 1998 International Conference on Acoustics, Speech and Signal Processing, pp.1145-1148, 1998.
[XuD95] D. Xu, “EM Algorithm and Baum-Eagon Inequality, Some Generalization Specification,” Technical Report, CNEL, Department of Electrical and CompEngineering, University of Florida, Gainesville, November, 1995.
[XuD96] D. Xu, C. Fancourt and C. Wang, “Multi-Channel HMM,” 1996 InternationConference on Acoustic, Speech & Signal Processing, Vol. 2, pp.841-Atlanta, GA, 1996.
[XuD98a] D. Xu, J. Fisher and J. C. Principe, “A Mutual Information Approach to PEstimation,” Algorithms for Synthetic Aperture Radar Imagery V, SPIE 98, 3370, pp.218-229, Orlando, FL, 1998.
[XuD98] D. Xu, J. C. Principe and H-C. Wu, “Generalized Eigendecomposition withOn-Line Local Algorithm”, IEEE Signal Processing Letter, Vol.5, No.11, pp.2301, November, 1998.
[XuL97] L. Xu, C-C. Cheung, H. H. Yang and S. Amari, “Independent Component Ansis by the Information-Theoretic Approach with Mixture of Densities,” proceings of 1997 International Conference on Neural Networks (ICNN’97), pp181826, Houston, TX, 1997.
[Yan97] H. H. Yang and S. I. Amari, “Adaptive On-Line Learning Algorithms for BlinSeparation: Maximum Entropy and Minimum Mutual Information,” NeuComputation, Vol.9, No.7, pp.1457-1482, October, 1997.
196
SSary,
[Yan98] H.H. Yang, S.I. Amari and A.Cichocki, “Information-Theoretic Approach to Bin Non-Linear Mixture,” Signal Processing, Vol.64, No.3, pp.291-300, Febru1998.
[You87] P. Young, The Nature of Information, Praeger, New York, 1987.
197
BIOGRAPHICAL SKETCH
Dongxin Xu was born on January 26, 1963, in Jiangsu China. He earned his bachelor’s
degree in electrical engineering from Xi’an Jiaotong University, China, in 1984. In 1987,
he received his Master of Science degree in computer science from the Institute of Auto-
mation, Chinese Academy of Sciences, Beijing, China. After that, he had been doing
research on speech signal processing, speech recognition, pattern recognition, artificial
intelligence and neural network in the National Laboratory of Pattern Recognition in
China, for 7 years. Since 1995, he has been a Ph.D student in the Department of Electrical
and Computer Engineering, University of Florida. He has worked in the Computational
Neuro-Engineering Laboratory on various topics in signal processing. His main research
interests are adaptive systems, speech coding, enhancement and recognition, image pro-
cessing, digital communication, and statistical signal processing.