design and analysis of generative models...
TRANSCRIPT
DESIGN AND ANALYSIS OF GENERATIVE MODELSFOR
BRAIN MACHINE INTERFACES
By
SHALOM DARMANJIAN
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
2009
c© 2009 Shalom Darmanjian
2
This dissertation is dedicated to my family.
3
ACKNOWLEDGMENTS
Although I still have far to go, I would not even be 1/100th my current distance
without my adviser Dr. Principe. His passion for knowledge inspires me to constantly
improve and learn. I have grown as a student, researcher, and person because of him.
These small words cannot express the large debt of gratitude I owe him.
Thank you to my committee members for their guidance and patience: Dr. Harris,
Dr. Rangarajan, Dr. Sanchez. I especially thank Dr. Slatton for his time under the
circumstances. I truly wish you and your family well.
I am also grateful for the environment Dr. Principe has fostered in CNEL. The
CNEL students past and present have provided great opportunities for discussions,
laughter and growth. Although there are many students in CNEL that have impacted
me, Jeremy Anderson has been there since undergrad taking the ride with me (ups and
downs). I also appreciate the four musketeers along the BMI ride with me: Dr. Antonio
Paiva, Dr. Aysegul Gunduz, Dr. Yiwen Wang and Dr. Jack DiGiovanna. Thank you all
for the helpful discussions and collaboration through the years. Thanks to the new batch
of CNEL students for their discussions and laughter, Sohan Seth, Alex Singh, Erion
Hasanbelliu, Luis Giraldo, Memming Park. Thank you to Julie for keeping CNEL running
smoothly. Thank you also to Marcus (even if he is a republican). Thank you to Shannon
for years of help and advice with the graduate department. A special thanks to a
longtime enemy Giovanni ”eleven cents!” Montrone for years of constructive pessimism
and encouraging words. He has been there since the beginning and hopefully till the
end.
The years during my PhD were also brightened by some special ladies, Sarah,
Melissa, and Grisel. Whether opening my eyes to vegetarian dishes, tattoos or Las
Vegas Casinos, I appreciate the time we spent together. You helped to lighten my stress
and expose me to different worlds. Thank you.
4
Finally, to my sister and nephew. Your love, support and sacrifice kept me going. I
am truly indebted and will always be there for you. You mean everything to me and I’m
very happy to call you my family.
5
TABLE OF CONTENTS
page
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
CHAPTER
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.1 Overview of BMIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151.2 Experimental Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.2.1 Monkey Food Grasping Task . . . . . . . . . . . . . . . . . . . . . 171.2.2 Monkey Cursor Control . . . . . . . . . . . . . . . . . . . . . . . . . 181.2.3 Rat Lever Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.3 Review of Modeling Paradigms for BMIs . . . . . . . . . . . . . . . . . . . 201.4 Dissertation Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2 GENERATIVE MODELS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.3 Background on Graphical Models . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.1 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . 302.3.2 Moving Beyond Simple HMMs . . . . . . . . . . . . . . . . . . . . . 32
3 BRAIN MACHINE INTERFACE MODELING: THEORETICAL . . . . . . . . . . 35
3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.2 Independently Coupled HMMs . . . . . . . . . . . . . . . . . . . . . . . . 373.3 Boosted Mixtures of HMMs . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.1 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.3.2 Modeling Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4 Linked Mixtures of HMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . 463.4.1 Modeling Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 463.4.2 Training with Expectation Maximization . . . . . . . . . . . . . . . . 483.4.3 Updating Variational Parameter . . . . . . . . . . . . . . . . . . . . 51
3.5 Dependently Coupled HMMs . . . . . . . . . . . . . . . . . . . . . . . . . 523.5.1 Modeling Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 533.5.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6
4 BRAIN MACHINE INTERFACE MODELING: RESULTS . . . . . . . . . . . . . 57
4.1 Monkey Food Grasping Task . . . . . . . . . . . . . . . . . . . . . . . . . 594.1.1 Boosted and Linked Mixture of HMMs . . . . . . . . . . . . . . . . 594.1.2 Dependently Coupled HMMs . . . . . . . . . . . . . . . . . . . . . 64
4.2 Rat Single Lever Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 684.3 Monkey Cursor Control Task . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.3.1 Population Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . 724.3.1.1 A-priori class labeling based on population vectors . . . . 744.3.1.2 Simple nave classifiers . . . . . . . . . . . . . . . . . . . 75
4.3.2 Results for the Cursor Control Monkey Experiment . . . . . . . . . 76
5 GENERATIVE CLUSTERING . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.1 Generative Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 835.2 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.2.1 Simulated Data Generation . . . . . . . . . . . . . . . . . . . . . . 865.2.2 Independent Neural Simulation Results . . . . . . . . . . . . . . . 905.2.3 Dependent Neural Simulation Results . . . . . . . . . . . . . . . . 111
5.3 Experimental Animal Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 1155.3.1 Rat Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1155.3.2 Monkey Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 1205.3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6 CONCLUSION AND FUTURE WORK . . . . . . . . . . . . . . . . . . . . . . . 131
6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1336.1.1 Towards Clustering Model Structures . . . . . . . . . . . . . . . . . 1356.1.2 Preliminary Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
6.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
A WIENER FILTER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
B PARTIAL DERIVATIVES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
C SELF ORGANIZING MAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
7
LIST OF TABLES
Table page
3-1 Classification performance of example single-channel HMM chains . . . . . . . 36
4-1 Classification results (BM-HMM selected channels) . . . . . . . . . . . . . . . . 59
4-2 Classification results (LM-HMM selected channels) . . . . . . . . . . . . . . . . 59
4-3 Classification results (random BM-HMM selected channels) . . . . . . . . . . . 62
4-4 Classification results (random LM-HMM selected channels) . . . . . . . . . . . 62
4-5 Correlation coefficient using DC-HMM on 3D monkey data . . . . . . . . . . . . 65
4-6 NMSE on 3D monkey data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4-7 Classification results (BM-HMM selected channels) . . . . . . . . . . . . . . . . 69
4-8 Classification results (LM-HMM selected channels) . . . . . . . . . . . . . . . . 69
4-9 Classification results (random BM-HMM selected channels) . . . . . . . . . . . 71
4-10 Classification results (random LM-HMM selected channels) . . . . . . . . . . . 71
4-11 Correlation coefficient using different BMI models . . . . . . . . . . . . . . . . . 78
4-12 Correlation coefficient using DC-HMM on 2D monkey data . . . . . . . . . . . . 80
5-1 Correlation coefficient using LM-HMM on 2D monkey data . . . . . . . . . . . . 123
5-2 Correlation coefficient using DC-HMM on 3D monkey data . . . . . . . . . . . . 124
5-3 Correlation coefficient using DC-HMM on cursor control data . . . . . . . . . . 126
6-1 Correlation coefficient using DC-HMM on 3D monkey data . . . . . . . . . . . . 139
6-2 Correlation coefficient using DC-HMM on 2D monkey data . . . . . . . . . . . . 142
8
LIST OF FIGURES
Figure page
1-1 BMI overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1-2 Discritized example of continuous trajectory . . . . . . . . . . . . . . . . . . . . 19
2-1 Various HMM structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3-1 Probabilistic ratios of 14 neurons . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3-2 Zoomed in version of the probabilistic ratios . . . . . . . . . . . . . . . . . . . . 37
3-3 IC-HMM graphical model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3-4 LM-HMM graphical model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3-5 DC-HMM trellis structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4-1 Multiple model methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4-2 Correlation coefficients between channels (monkey moving) . . . . . . . . . . . 60
4-3 Correlation Coefficients between channels (monkey at rest) . . . . . . . . . . . 61
4-4 Correlation coefficient between channels (randomly selected for monkey) . . . 62
4-5 Monkey expert adding experiment . . . . . . . . . . . . . . . . . . . . . . . . . 63
4-6 Parallel peri-event histogram for monkey neural data . . . . . . . . . . . . . . . 64
4-7 Supervised monkey food grasping task reconstruction (position) . . . . . . . . 66
4-8 3D monkey food grasping true trajectory . . . . . . . . . . . . . . . . . . . . . . 67
4-9 Hidden state space transitions between neural channels (for move and rest) . . 67
4-10 Coupling coefficient between neural channels (3D monkey experiment) . . . . 68
4-11 Parallel peri-event histogram for rat neural data . . . . . . . . . . . . . . . . . . 70
4-12 Rat expert adding experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4-13 Neural tuning depth of four simulated neurons . . . . . . . . . . . . . . . . . . . 73
4-14 Histogram of 30 angular velocity bins . . . . . . . . . . . . . . . . . . . . . . . . 74
4-15 2D angular velocities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4-16 A. Parallel tuning curves B. Winning neurons for particular angles . . . . . . . . 76
4-17 Histogram of 10 angular bins . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
9
4-18 A. Parallel tuning curves B. Winning neurons for particular Angles . . . . . . . 77
4-19 True trajectory and reconstructed trajectory (DC-HMM) . . . . . . . . . . . . . 81
4-20 Hidden state transitions per class (cursor control monkey experiment) . . . . . 81
4-21 Coupling coefficient between neurons per class(cursor control monkey experiment) 82
5-1 Bipartite graph of exemplars (x) and models . . . . . . . . . . . . . . . . . . . . 85
5-2 Neural tuning depth of four simulated neurons . . . . . . . . . . . . . . . . . . . 89
5-3 LM-HMM cluster iterations (two classes, k=2) . . . . . . . . . . . . . . . . . . . 90
5-4 Tuning preference for two classes (initialized) . . . . . . . . . . . . . . . . . . . 91
5-5 Tuned classes after clustering (two classes) . . . . . . . . . . . . . . . . . . . . 92
5-6 LM-HMM cluster iterations (four classes, k=4) . . . . . . . . . . . . . . . . . . . 93
5-7 Tuning preference for four classes (initialized) . . . . . . . . . . . . . . . . . . . 94
5-8 Tuned classes after clustering (four classes) . . . . . . . . . . . . . . . . . . . . 95
5-9 LM-HMM cluster iterations (two classes, k=4) . . . . . . . . . . . . . . . . . . . 96
5-10 Classification degradation with increased random firings . . . . . . . . . . . . . 96
5-11 Neural tuning depth with high random firing rate . . . . . . . . . . . . . . . . . 97
5-12 Surrogate data set destroying spatial information . . . . . . . . . . . . . . . . . 98
5-13 Tuned preference after clustering (spatial surrogate) . . . . . . . . . . . . . . . 98
5-14 Surrogate data set destroying temporal information . . . . . . . . . . . . . . . . 99
5-15 Tuned preference after clustering (temporal surrogate) . . . . . . . . . . . . . . 99
5-16 DC-HMM clustering results (class=2, K=2) . . . . . . . . . . . . . . . . . . . . . 100
5-17 DC-HMM clustering hidden state transitions (class=2, K=2) . . . . . . . . . . . 101
5-18 DC-HMM clustering coupling coefficient (class=2, K=2) . . . . . . . . . . . . . 101
5-19 DC-HMM clustering log-likelihood reduction during each round (class=2, K=2) 102
5-20 DC-HMM clustering simulated neurons (class=4, K=4) . . . . . . . . . . . . . . 103
5-21 DC-HMM clustering hidden state space transitions between neurons . . . . . . 104
5-22 DC-HMM clustering coupling coefficient between neurons (per Class) . . . . . 105
5-23 SOM clustering on independent neural data (2Classes) . . . . . . . . . . . . . 107
10
5-24 SOM clustering on independent neural data with noise (2Classes) . . . . . . . 108
5-25 SOM clustering on independent neural data spatial surrogate(2classes) . . . . 108
5-26 Neural selection by SOM on spatial surrogate data(2classes) . . . . . . . . . . 109
5-27 SOM clustering on independent neural data temporal surrogate(2classes) . . . 110
5-28 Output from four simulated dependent neurons with 100 noise channels(Class=2)111
5-29 Neural tuning for dependent neuron simulation . . . . . . . . . . . . . . . . . . 112
5-30 LM-HMM clustering simulated dependent neurons (class=2, K=2) . . . . . . . 113
5-31 DC-HMM clustering simulated dependent neurons (class=2, K=2) . . . . . . . 113
5-32 SOM clustering on dependent neural data (2classes) . . . . . . . . . . . . . . . 114
5-33 Rat clustering experiment, one lever, two classes . . . . . . . . . . . . . . . . . 116
5-34 Rat clustering experiment zoomed, one lever, two classes . . . . . . . . . . . . 117
5-35 Rat clustering experiment, two lever, two classes . . . . . . . . . . . . . . . . . 117
5-36 Rat clustering experiment, two lever, three classes . . . . . . . . . . . . . . . . 118
5-37 Rat clustering experiment, two lever, four classes . . . . . . . . . . . . . . . . . 119
5-38 LM-HMM cluster iterations (Ivy 2D dataset, k=4) . . . . . . . . . . . . . . . . . 121
5-39 Reconstruction using unsupervised LM-HMM clusters (blue) vs. real trajectory(red) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5-40 DC-HMM clustering on monkey food grasping task (2classes) . . . . . . . . . . 125
5-41 Coupling coefficient from DC-HMM clustering on monkey food grasping task(2classes) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5-42 Coupling coefficient from DC-HMM clustering on monkey cursor control task(4classes) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5-43 Average firing rate per class (4classes,6neurons) . . . . . . . . . . . . . . . . . 127
5-44 Average velocity per class (4classes,6neurons) . . . . . . . . . . . . . . . . . . 128
6-1 Bipartite graph of exemplars (x) and models . . . . . . . . . . . . . . . . . . . . 134
6-2 Hidden state transitions DC-HMM (simulation data 2classes) . . . . . . . . . . 138
6-3 Hidden state transitions DC-HMM (simulation data 2classes) . . . . . . . . . . 139
6-4 Histogram of state models for the DC-HMM (food grasping task) . . . . . . . . 140
11
6-5 State models for the DC-HMM (food grasping task) . . . . . . . . . . . . . . . 141
6-6 Alphas computed per state per channel DC-HMM (food grasping task) . . . . 141
6-7 Alphas across state-space of the DC-HMM (cursor control task) . . . . . . . . 142
A-1 Topology of the linear filter for three output variables . . . . . . . . . . . . . . . 147
C-1 Self-Organizing-Map architecture with 2D output . . . . . . . . . . . . . . . . . 151
12
Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy
DESIGN AND ANALYSIS OF GENERATIVE MODELSFOR
BRAIN MACHINE INTERFACES
By
Shalom Darmanjian
December 2009
Chair: Jose PrincipeMajor: Electrical And Computer Engineering
Brain machine interfaces (BMIs) have the potential to restore movement to patients
experiencing paralysis. Although great progress has been made towards BMIs there
is still much work to be done. This dissertation addresses some of the problems
associated with the signal processing side of BMIs.
Since neural communication within the brain is still unknown, probabilistic modeling
is argued as the best approach for BMIs. Specifically, generative models are proposed
with hidden variables to help model the multiple interacting processes (both hidden and
observable). Some of the advantages of the generative models over the conventional
BMI signal processing algorithms are also confirmed. This includes the modeling of
inhibited neurons and the ability to separate the neural input space. The partitioning
of the input neural space is based on the hypothesis that animals transition between
neural state structures during goal seeking. These neural structures are analogous
to the motion primitives or ’movemes’, exhibited during the kinematics. This leads
to a paradigm shift similar to a divide and conquer methodology but with generative
models. The generative models are also used to cluster the neural input space. This is
appropriate since the desired kinematic data is not available from paralyzed patients.
Most BMI algorithms ignore this very important point.
13
The results are justified with the improvement in trajectory reconstruction.
Specifically, the correlation coefficient on the trajectory reconstruction serves as a
metric to compare against other BMI methods. Additionally, simulations are used to
show the models’ ability to cluster unknown data with underlying dependencies. This is
necessary since there are no ground truths in real neural data.
14
CHAPTER 1INTRODUCTION
Humans learn motor control by physically interacting with the external world. After
learning this control, simple physical tasks are often taken for granted in everyday
life. Drinking a cup of coffee or eating derives automatically from our desires without
consciously planning these simple physical tasks. Additionally, this unique ability to
translate desires into physical movements underlies tool utilization, which separates our
species from other animals. One theory even postulates that without the ability to touch
for socialization, our species would not be on the current evolutionary path [1, 2].
During even simple movement, the human central nervous system must translate
generated firings of millions of neurons while also communicating to the peripheral
nervous system [3]. This complex biological system also continuously manages
chemical and electrical information from the different cortices, and paleocortex structure
to control unconscious and conscious actions taken with the body [3]. The brain must
handle all of these actions while continuously processing visual, tactile and other internal
sensory feedback [3, 4].
Unfortunately, thousands of people have suffered tragic accidents or debilitating
diseases that have either partially or fully removed their ability to effectively interact in
the external world [5]. Some devices exist to aid these types of patients, but often lack
the requirements to live a normal life. Essentially the idea behind motor Brain Machine
Interfaces (BMIs) is to bridge the gap between the brain and the external world to
provide these patients with effective world-interaction.
1.1 Overview of BMIs
A BMI is a system that directly retrieves neuronal firing patterns from dozens to
hundreds of neurons in the brain and then translates this information into desired actions
in the external world. The level of invasiveness in acquiring these firing patterns is
directly related to the level of resolution provided by the recording methodology. An
15
Figure 1-1. BMI overview
electrocorticographic (EEG) system is one method that roughly records from multiple
neurons through the scalp (non-invasive). While Electroencephalographic (Ecog) and
Microelectrode Arrays, although invasive, provide a finer resolution of individual neurons
(or single units). In turn, the finer resolution has allowed for significant advances in the
field of BMIs to take place recently [6, 7]. Consequently, Microelectrode Array data is the
only type of data used throughout this dissertation.
In order to acquire Microelectrode Array Data, multiple electrode grid arrays
are chronically implanted in one or more cortices [8] and record analog voltages
from multiple neurons near each individual electrode. The neural signals (i.e. analog
voltages) then progress through three processing steps (in a typical BMI). First, the
amplified analog voltages recorded from one or more neuron is digitally converted
and passed to a spike detecting/sorting algorithm. A spike detecting and sorting
algorithm essentially identifies if a particular neuron exhibited a firing/voltage pattern
on a corresponding electrode (thereby classifying it as a spike). During the second BMI
16
processing step, the identified discrete spikes are processed by a signal processing
algorithm [8]. Finally, subsequent trajectory/lever calculations are sent to a robot arm or
display device. All of these processing steps occur as an animal engages in a behavioral
experiment (lever press, food grasping, finger tracing, or joystick control).
This dissertation focuses on the signal processing of the neural spikes (the second
processing step described above). Early approaches to modeling BMIs used simple
population vectors, Wiener filters and artificial neural networks [8, 9] between the neural
data and arm trajectory to learn a functional relationship during training. During testing
only the neural data is used to reconstruct the predicted trajectory the animal wanted.
Most of these modeling approaches use binned spikes, commonly referred to as rate
coding, to create the output. Explicit relationships between the neurons are usually
assumed to be independent for modeling purposes and the encoding methodology is
often not even considered [8]. Please see [10, 11] to gain a more detailed understanding
of what encompasses a BMI.
1.2 Experimental Data
This section discusses the type of neural data and kinematic data that will be used
throughout the dissertation. Consequently, the properties of the data help to determine
the appropriate type of models to use.
1.2.1 Monkey Food Grasping Task
For this experiment, an owl monkey uses the right hand to grasp food from four
locations on a tray and then brings the food to its mouth. The recorded neural data is a
sparse time series of discrete firing counts that were extracted from the dorsal premotor
cortex (PMd), primary motor cortex (MI), and posterior parietal cortex (PP) [4, 12]. Each
firing count represents the number of neural firings in a 100ms span of time, which
is consistent with methods used within the neurological community [6, 13, 14]. The
corresponding kinematic data (i.e. hand location) is down sampled from the original
10 Hz to match the 100ms neural spike bins. This particular monkey data set contains
17
104 neural channels recorded for 38.33 minutes. The time recording corresponds to a
dataset of 23000x104 time bins.
Within the neurological community, the question of whether the motor cortex
encodes the arm’s velocity, position, or other kinematic encodings (joint angle, muscle
activity, muscle synergy, etc), continues to be debated [6, 8, 13]. With this particular
experimental data, the monkey’s arm is motionless in space for a brief amount of time
while reaching for food or placing food in its mouth. During this time of ”active holding”
it is unknown if the brain is encoding information to contract the muscles in the holding
position.
Our work as well as other research has shown this active holding encoding is likely
[15]. Due to this belief, the active holding is included as part of the movement class
for the classifiers in this dissertation [16]. An example of this type of data is shown in
Figure 1-2, along with the superimposed gray-scale colors representing the monkey’s
arm movement on the three Cartesian coordinate axes’ (x, y, and z). Note that the
movement and rest classes are labeled by ’hand’ from the 10Hz (100ms) trajectory data.
1.2.2 Monkey Cursor Control
During this experiment, an adult Macaca mulatta monkey performs a manipulandum
behavioral task (cursor control) [4, 8]. The monkey used a hand-held manipulandum
(joystick) to move the cursor (smaller circle) so that it intersects the target. Upon
intersecting the target with the cursor, the monkey received a juice reward. While the
monkey performed the motor task, the hand position and velocity for each coordinate
direction (X and Y ) were recorded in real time along with the corresponding neural
activity. Micro wire electrode arrays chronically implanted in the dorsal premotor
cortex (PMd), supplementary motor area (SMA), primary motor cortex (M1, both
hemispheres) and primary somatosensory cortex (S1), the firing times of up to 185 cells
were simultaneously collected.
18
Figure 1-2. Discritized example of continuous trajectory
For the monkey neural data, each firing count again represents the number of
neural firings in a 100ms span of time. This particular monkey data set contains
185 neural channels recorded for 43.33 minutes. The subsequent time recording
corresponds to a dataset of 26000x185 time bins.
1.2.3 Rat Lever Experiments
There are two rat data sets used for the Rat lever experiments. For the ”single-lever”
data set, thirty-two microwire electrodes were implanted unilaterally in the forelimb
region of primary motor cortex of a male Sprague-Dauley rat [17]. The task requires a
rat to press a single lever for a minimum of 0.5s to achieve a water reward once a LED
visual stimulus is observed. Essentially this go-no-go experiment includes neural data
along with lever presses. This particular data set contains 16 neurons yielding 13000x16
time bins. With each time being the spike sorted count per 100ms of a particular neuron
(sixteen in this data set).
19
For the ”two-lever” rat data set, two 16-microelectrode arrays in the forelimb
regions of each hemisphere [18]. The task requires a rat to press one of two levers
for a minimum of 0.5s to achieve a water reward (also after a LED visual stimulus).
Essentially this go-no-go experiment also provides neural data along with lever presses.
This particular data set contains 42 neurons yielding 19000x42 time bins (100ms each).
For both rat experiments, the rat is free to move around the cage while there is no
measurement of the movement.
1.3 Review of Modeling Paradigms for BMIs
The modeling approaches to BMIs is divided into three categories, supervised,
co-adaptive, and unsupervised, with the majority of BMI modeling algorithms being
supervised. Additionally, most of the supervised algorithms further split into linear
modeling, non-linear modeling, and state-space (or generative) modeling, with again the
majority of BMI algorithms fall under supervised linear modeling.
Supervised linear modeling is traceable to the 80’s and 90’s when neural action
potential recordings were taking place with multi-electrode arrays [6, 8, 9]. Essentially,
the action potentials (called spikes for short), collected with micro-electrode arrays,
are sorted by neuron and counted in time windows (called bins) and fed into a linear
model (Wiener filter) with a predefined tap delay line depending on the experiment
and experimenter [19, 20]. During training, the linear model has access to the desired
kinematic data, as a desired response, along with the neural data recording. Once
a functional mapping is learned between the neural data and kinematic training set,
for testing the linear model is only provided neural data in which the kinematic data is
reconstructed [19].
The Wiener filter is exploited in two ways for this dissertation. First, it serves a
baseline linear classifier (with threshold) to compare results with the models discussed.
Second, Wiener filters are used to reconstruct the trajectories from neural data switched
20
by the generative models. Since the Wiener is very critical to this dissertation and
serves at the core of many BMI systems today, the details are presented in AppendixA.
Along with the supervised linear modeling, researchers have also engaged in
non-linear supervised learning algorithms for BMIs [8, 20]. Very similar to the paradigm
of the linear modeling, the neural data and kinematic data are fed to a non-linear model
that subsequently finds the relationship between the desired kinematic data and the
neural data. During testing, only neural data is provided to the models in order produce
kinematic reconstructions.
With respect to state-space models, Kalman filters have been used to reconstruct
the trajectory of a behaving monkey’s hand [21]. Specifically, a generative model is
used for encoding the kinematic state of the hand. For decoding, the algorithm predicts
the state estimates of the hand and then updates this estimate with new neural data
to produce a posteriori state estimate. Our group at UF and elsewhere found that
the reconstruction was slightly smoother than what input-output models were able to
produce [21].
Unfortunately, there are problems with all of the aforementioned models. First,
training with desired data is problematic since paralyzed patients are not able provide
kinematic data. Second, these models do not capture neurons that fire infrequently
during movement. These neurons are known to exist in the brain and can provide useful
information. But the information is lost with feed-forward filters since a neuron that fires
very little will receive less weighting. Third, there are millions of other neurons not being
recorded or modeled. Incorporating this missing information into the model would be
beneficial. Lastly, all of these models must generalize over a wide range of movements.
Normally, generalization is good for a model of the same task, but generalization across
tasks is problematic. For example, a data set consisting of 3D food grasping movements
and circular 2D movements would require the model to generalize over both kinematic
sets (producing poorer results).
21
The lack of a desired signal therefore necessitates the need for unsupervised or
co-adaptive solutions. Recent Co-adaptive solutions, have relied on the test subject to
train their own brain for goal oriented tasks. One researcher even uses technicians to
supply an artificial desired signal as the input-output models learn a functional mapping
[22]. Digiovanni et al, used reinforcement learning to co-adapt their model and the rat’s
behavior for goal oriented tasks [23].
With respect to unsupervised learning for BMIs, Si et al, use PCA as a feature
extraction preprocessing for SVM and Bayesian classifiers [24]. Their main goal is to
classify actions which move the rat (on a mechanical cart) towards a reward location.
Other research demonstrated that neural state structures exist as animal subjects
engage in movement [25]. This research also demonstrated that by finding partitions
in the neural input space that correspond to specific motor states (hand moving or
at rest), trajectory reconstruction is improved when the models are constructed with
homogeneous data from only one partition [26]. Specifically, a ’switching’ generative
model partitions the animal’s neural firings and corresponding arm movements into
different motion primitives [25]. This improvement is primarily due to the ability of
the generative models to distinguish between neural states, and thereby allowing
the continuous filters to specialize in a particular part of the input space rather than
generalize over the full space. The work also demonstrated that modeling is improved
when spatial dependencies are exploited between neural channels [27, 28].
Other generative model (even graphical models like HMMs) [29–31] work on BMIs
exploit the hidden state variables to decode states taken or transitioned by the behaving
animal. Specifically, Shenoy (et al) found that HMMs can provide representations of
movement preparation and execution in the hidden state sequences [31]. Unfortunately,
this work also requires the use of supervision or a training data that must first be divided
by human intervention (i.e. a user). Therefore to move beyond supervised data, which is
not acquirable from paraplegics, a clustering algorithm is necessary.
22
Although there is little BMI research into clustering neural data with graphical
models, there have been efforts to use HMMs for clustering. The use of HMMs for
clustering appears to have first been mentioned by Juang and Rabiner [32] and
subsequently used in the context of discovering subfamilies of protein sequences by
Krogh et al [33]. Other clustering work focused on single HMM chains for understanding
the transition matrices [34].
The work described in this dissertation moves significantly beyond prior work in two
ways. First, the clustering model finds unsupervised hierarchal dependencies between
HMM chains (per neuron) while also clustering the neural data. Essentially the algorithm
jointly refines the model parameters and structures as the clustering iterations occur.
The hope is that the clustering methodology will serve as a front end for a goal-oriented
BMI (with the clusters representing a specific goal, like ’forward’) or for a Co-adaptive
algorithm that needs reliable clustering of the neural input. Second, the paradigm is
changed to encompass a multiple-model approach to improve performance.
1.4 Dissertation Objectives
The dissertation focuses on finding neural state structures, also called neural
assemblies, which exist as humans/animals engage in movement. Although there has
been work to decompose kinematics into elemental components like motion primitives or
’movemes’, there has been little work on finding the reciprocal sub-structures in neural
data [13, 35]. Since there are no known ground-truths for these types of structures
or definitive partitions, secondary evidence must be provided. Specifically, we argue
that the best way to show that our methodologies are discovering these beneficial
structures is on how well the trajectory prediction is improved. Trajectory improvement
is accomplished by first using our models like a ’switch’ to partition the primate’s neural
firings and corresponding arm movements into different primitives. Similar to a divide
and conquer strategy, by switching or delegating these isolated neural/trajectory data
to different local linear models, prediction of final kinematic trajectories is markedly
23
improved. The models will also address the broad spectrum of neural dependencies,
from independent to implicit and explicit dependencies. After establishing the benefit of
partitioning the input with supervised generative models, two unsupervised clustering
strategies are presented. The first methodology clusters the data samples using the
likelihood as a distance metric and refines the model parameters as if they are centroids
similar to k-means. The second methodology clusters the actual parameters of the
models for the neural state structures in order to refine the parameters. The hypothesis
is that by observing the neural state evolution, neurophysiologic understanding can be
gained about the neural interactions. Simulated neural data will also be used on the
models in order to control and observe different aspects of the clustering results.
Chapter 2 provides background on graphical models since they form the core
framework of the dissertation. The chapter will provide details of related work on these
generative probabilistic models. The chapter will also cover in detail Hidden Markov
Models (HMMs) since they are one particular graphical model used for modeling in this
dissertation.
Chapter 3 discusses the hierarchal decomposition of neural data and outlines some
models that temporally and spatially model the data. In particular, the chapter details the
independently coupled Hidden Markov Model (IC-HMM) and how it models neural data.
The chapter also discusses the Boosted Mixture of HMMs (BM-HMM) and discusses the
implicit hierarchal relationship between neural channels that are formed with this model.
Then the Linked Mixtures of HMMs are discussed and how this model explicitly develops
dependencies between neural channels. Finally, the chapter ends with a presentation
of the Dependently Coupled HMMs (DC-HMM) model. This model explicitly models
dependencies through time across the channels.
Chapter 4 will present the results from the broad spectrum of models discussed in
Chapter 3, using real neural data from the three animal experiments (discussed earlier).
24
The results are compared against other models by using the correlation coefficient and
classification performance.
Chapter 5 covers a model based clustering methodology using the LM-HMM and
DC-HMM models. Real neural data as well as simulated neural data will be used to
test the clustering methodology. Chapter 6 presents the conclusion and includes a
discussion on using the neural state transitions as features for another unsupervised
clustering methodology. Future work will be postulated as well as preliminary results
from this neurophysiologic clustering perspective.
25
CHAPTER 2GENERATIVE MODELS
2.1 Motivation
As discussed in the Chapter 1, many BMI researchers use linear and non-linear
models to find a mapping between desired kinematic data (exhibited by a behaving
animal) and neural data. Unfortunately, the reality of this type of experiment is that the
mapping of a single individual will not correspond to the mapping of another individual
[3]. Additionally, the patients that are involved in BMI research are paralyzed, which
prevents kinematic data from being recorded during the training of a model. In order
to provide paralyzed patients a direct ability to interact in the external world, the BMI
solution will need to extract as much information from the neural input space.
Unfortunately, in the neural input space there are many problems. First, only a
few neurons are being sampled from millions. Second, information about the physical
connectivity among these sampled neurons is not available with current technology
[6, 8]. Although some histological studies can be done to stain for the type of neurons
acquired, they do not provide information about their inter-communication (and animals
are euthanized for these studies) [6]. Third, the sample of neurons acquired in one
experiment will not be the same neurons nor represent the same motor functions in
another experiment with different patients [3]. Additionally, some neurons in the sample
may not even contribute to the task being modeled.
With the absence of information, a probabilistic approach is the best approach
to model what is observable from the brain. Modeling the unknown hidden neural
information is accomplished with observable and hidden random processes that are
interacting with each other. Specifically, we make the assumption that each neuron’s
output is an observable random process that is affected by hidden information. Since
the experiment does not provide detailed biological information about the interactions
between the sampled neurons, hidden variables are used to model these hidden
26
interactions [3, 6]. We further assume that the compositional representation of the
interacting processes occurs through space and time (i.e. between different neurons
at different times). Graphical models are the best way to model and observe this
interaction between variables in space and time [36]. Another benefit of a state-space
generative model over traditional filters is that neurons that fire less during certain
movements can be modeled simply as another state rather than a low filter weight value.
Neuroscientists often treat the multiple channels of neural data acquired from BMI
experiments as multivariate observations from a single process [10]. This perspective
requires fully coupled statistics across all of the channels at all times irrespective of
partial independence among the multiple processes. Our group’s own work has shown
that modeling this data as a single multivariate process is not the most appropriate [16].
The models described within this dissertation also differ from other work in the BMI field
since they are not a traditional regression approach of mapping the neural data directly
to the patient hand kinematics with a conventional linear/non-linear model [10]. Instead,
generative models are used to divide the input space into regions that represent cell
assemblies which are referred throughout the dissertation as neural state structures.
These structures have been loosely touched upon in other work [19, 37]. The basic
idea is that kinematic information or neural data is decomposed into ”motion primitives”
similar to phonemes in speech processing.
In speech processing, graphical models (specifically HMMs) are the leading
technology because they are able to capture very well the piecewise non-stationarity
of speech [32, 38]. Since speech production is ultimately a motor function, graphical
models are potentially useful for motor BMIs (also non-stationary) [32, 39]. By using
smaller simpler components, more complicated arm kinematics are constructed
through the combination of these simple structures. We hypothesize that there are
decomposable structures in the input space that is analogous to the primitives in the
kinematic space.
27
The ultimate goal is to decipher these underlying structures so that the model may
one day be decoupled from the desired kinematics (for unsupervised modeling). The
goal of Chapter 4 is to determine these underlying structures through a supervised
mode in order to later exploit them in an unsupervised mode (Chapter 5).
2.2 Related Work
Graphical models have been used as a probabilistic model for brain activity. In the
early 90’s, Radons et al used HMMs to code the information contained in the neural
activity of a monkey’s visual cortex during different visual stimuli [29]. Later, Gat et
all used an HMM to discover the underlying hidden states during a go-no-go monkey
experiment. The HMM model provided insight into the underlying cortical network
activity of behavioral processes [30]. Specifically, they could identify the behavioral
mode of the animal and directly identify the corresponding collective network activity
[30]. Additionally, by segmenting the data into discrete states Radons et al demonstrated
that there may be dependency of the short-time correlation between neural cells.
In recent years there has been a renewed interest in using generative models for
BMIs. Other researchers use a hybrid generative model to decode trajectories [21]. In
their model, they incorporate neural states and hand states similar to other filter work
but solely in a probabilistic framework. For the continuous hand state they use the
mixture-of-trajectories model.
Although not explicitly for BMIs, graphical models have been used to model the
brain through belief propagation [40]. In particular this work has focused on the visual
cortex and detection of motion using an HMM.
2.3 Background on Graphical Models
Graphical models incorporate different aspects of probability theory with graph
theory. In the field of machine learning, graphical models play an increased role since
they handle uncertainty and provide a decrease in complexity. They achieve this by
using simple models to build complex systems. Since probability theory ensures that
28
the systems are consistent, graphical models also helps describe the data. Specifically,
graphical models provide an intuitive way to understand the interaction between multiple
variables as well as the structure so that efficient algorithms are tailor-made. Many
scientific fields with multivariate probabilistic systems implement the special cases of
graphical models, like mixture models, factor analysis, Hidden Markov models, and
Kalman filters.
In the framework of graphical models, nodes are used to represent random
variables and arcs (or lack of) represent assumptions of conditional independence.
In turn this provides a compact representation of joint probability distributions. As an
example, if N binary random variables represent the joint P (X1, ..., Xn), then O(2N)
parameters are needed, whereas a graphical model may need much fewer, depending
on the a-priori assumptions. Consequently this type of decomposition helps with
inference and learning.
There are two main kinds of graphical models, undirected and directed. Undirected
graphical models, also known as Markov random fields (MRFs), make no prior
assumptions about causal relationships. Directed graphical models, also known
as Bayesian networks, Belief networks, causal models, generative models, etc,
make assumptions about causal relationships between variables. For example, let Z
represent the set of variables (both hidden and observed) included in the probabilistic
model. A graphical model (or Bayesian network) representation provides insight
into the probability distributions over Z encoded in a graph structure [41]. With this
type of representation, edges of the graph represent direct dependencies between
variables. As stated earlier, the absence of an edge allows the assumption of conditional
independence between variables. Ultimately, these conditional independencies allow a
more complicated multivariate distribution to be decomposed (or factorized) into simple
and tractable distributions [41].
29
Since there are a variety of graphical model representations that decompose
the joint probability of the hidden and observed variables in Z, choosing the best
approximation can be overwhelming. For this dissertation, directed graphical models are
used since they help explain the hidden relationships between the random processes
occurring with the sampled neurons of the brain. One particular directed graphical
model that is discussed throughout this dissertation is the Hidden Markov Model (HMM).
2.3.1 Hidden Markov Models
The HMMs discussed in this dissertation are discrete-output HMMs since they
are computational simple and less sensitive to initial parameter settings during training
[32]. Consider the graphical model in Figure 2-1E, which represents a hidden Markov
model (HMM). This type of structure decomposes the joint distribution. For a sequence
of length T, we simply ”unroll” the model for T time steps. The Markov property states
that the future is independent of the past given the present. This Markov chain is
parameterized with the triplet, λ = {A,B , π} , where A is the probabilistic NXN
state transition matrix, B is the LXN output probability matrix (with L discrete output
symbols), and π is the N-length initial state probability distribution vector [32, 42]. If
these parameters are treated as random variables (as in the Bayesian approach),
parameter estimation becomes equivalent to inference. If the parameters are treated as
unknown quantities, parameter estimation requires a separate learning procedure.
In order to maximize the probability of the observation sequence O, the model
parameters (A,B, π) must be estimated. Maximizing the probability is a difficult task;
first, there is no known way to analytically solve for the parameters that will maximize
the probability of the observation sequence [32]. Second, even with a finite amount
of observation sequences it is unlikely to find the global optimum for the parameters
[32]. In order to circumvent this issue, the Baum-Welch method can iteratively choose
λ = {A, B, π} that will locally maximize P (O|λ)) [42].
30
Specifically, for the Baum-Welch method first uses the current estimate of the HMM
λ = {A,B, π} and an observation sequence O = {O1, ..., OT} to produce a new estimate
of the HMM given by λ = {A, B, π}, where the elements of the transition matrix A,
aij = (
∑T−1t=1 ζt(i, j)∑T−1t=1 γt(i)
), i, j ∈ {1, ..., N}. (2–1)
Similarly, the elements for the output probability matrix B,
bj(k) =
∑t γt(j)(where ∀Ot = vk)∑T
t=1 γt(j), j ∈ {1, .., N}, k ∈ {1, ..., L}, (2–2)
and finally the π vector,
πi = γ1(i), i ∈ {1, ..., N}, (2–3)
where,
ζt(i, j) =αt(i)aijbj(Ot+1)βt+1(j)
P (O|λ)and (2–4)
γt(i) =N∑
j=1
ζt(i, j). (2–5)
Please note, β is the backward variable, which is similar to the forward variable
α except that now the values are propagated back from the end of the observation
sequence, rather than forward from the beginning of O [42].
Specifically the α quantity is recursively calculated by setting
α1(j) = πjbj(o1) (2–6)
αt+1(i) = [N∑
j=1
αt(j)aij]bi(ot+1) (2–7)
The well known backward procedure is similar
βt(j) = P (Ot+1 = ot+1, ..., OT = oT |St = j, Θ) (2–8)
31
Figure 2-1. Various HMM structures
this computes the probability of the ending partial sequence ot+1, ...oT given the start at
state j at time t. Recursively, βt(j) is defined as
βT (j) = 1 (2–9)
βt(j) =N∑
k=1
aijbi(ot+1)βt+1(i) (2–10)
2.3.2 Moving Beyond Simple HMMs
As discussed earlier, there are a variety of HMM architectures that have been
proposed to address a specific class of problems and to overcome certain limitations
in the traditional HMM. This section will just outline a few models (shown in Figure 2-1)
that are related to finding dependencies within the hidden state space. The standard
fully-coupled HMMs (Figure 2-1A) generally refer to a group of HMM models in which
32
the state of one model at time t depends on the states of all models (including itself) at
time t − 1. For C HMMs coupled together the state transition probability is described as
P (S(c)t |S(1)
t−1, S(2)t−1...S
(C)t−1) instead of P (S
(c)t |S(c)
t−1) as in a single standard HMM model. In
other words, the state transition probability is described by a (C + 1) dimensional matrix
and the number of free parameters for this transition probability matrix is NC, which is
exponential in the number of models coupled together (assume the number of hidden
state N is the same for all models) [43]. Parameter learning is very difficult with this type
of structure. Some researchers have created variations of the fully-coupled HMMs in
order to decrease the model size and ease the complexity of inference. Coupled HMMs
[44] proposed by Matthew Brand models the joint conditional dependency as the product
of all marginal conditional probabilities, i.e.
P (S(c)t |S(1)
t−1, S(2)t−1, , , S
(C)t−1) =
C∏c=1
P (S(c)t |S(c)
t−1) (2–11)
This simplification reduced the transition probability parameter space. It has been
used for recognizing complex human actions/behaviors [44, 45]. But Brand did not
give any assumption or condition under which this equation can hold [43]. Additionally,
there are no parameters to directly capture the interaction between channels. Another
variation of the fully coupled HMMs approximates the EM algorithm with particle filtering.
This particular model was used to model freeway traffic [46]. Figure 2-1B is a specific
coupled HMMs called an event-coupled HMMs [47]. This model is used for a class
of loosely coupled time series where only the onset of events is coupled in time. The
factorial HMM [48] (Figure 2-1C) represents the opposite of the CHMM by using multiple
hidden state chains to represent a single chain of observables. This model is not
exploiting the interaction or dependency between multiple models. In Figure 2-1D, the
IO-HMM is used for modeling the input-output sequence pair. Although the IO-HMM is
similar to the coupled HMM the input used in IO-HMM and hidden state from previous
33
time slice are different, certain independent assumption of inputs does not apply to the
hidden states and the inference algorithm used in [49] is only for one HMM model and
again not for general multiple coupled HMMs [43]. Chapter 3 will discuss an alternative
and more reasonable formulation of the CHMM which also reduces the parameter space
but includes parameters that capture the coupling strength between HMMs.
34
CHAPTER 3BRAIN MACHINE INTERFACE MODELING: THEORETICAL
3.1 Motivation
In the initial experimentation with HMMs, we treated the neural data as multivariate
features of a single process. First, the multi-dimensional neural data was converted into
a set of discrete symbols for the discrete-output HMM’s [16] using the Linde-Buzo-Gray
(LBG) VQ algorithm [50]. Then a single HMM was trained with these symbols corresponding
to neural data of a particular class (movement vs. rest for the monkey food grasping
task). Unfortunately, this switching classifier only maximally achieved 87% classification
accuracy [16]. Although, these results lent support to using HMMs on BMI data since
trajectory reconstruction was improved over a single linear model, the results were only
fair and motivated further investigation.
In trying to improve upon this performance, the relevancy of particular neurons to
a respective task (i.e. movement or rest) was explored in order to congregate different
neurons into corresponding subsets. To quantify the differentiation, we examined how
well an individual neuron classifies movement vs. rest when trained and tested on an
individual HMM chain. Since each neural channel is binned into a discrete number of
spikes per 100ms, the neural data was directly used as input.
During the evaluation of these particular HMM chains, the conditional probabilities
for rest or movement are respectively computed as P(O (i)|λ(i)r ) and P(O (i)|λ(i)
m ) for
the i -th neural channel, where O (i) is the respective observation sequence of binned
firing counts and λ(i)m , λ
(i)r represent the given HMM chain parameters for the class of
movement and rest, respectively. To give a qualitative understanding of these weak
classifiers, Figure 3-1 presents the probabilistic ratios from 14 single-channel HMM
chains (shown between the top and bottom movement segmentations) that produced the
best classifications individually. Specifically, the figure illustrates the simple ratio
P (O(i)|λ(i)m )
P (O(i)|λ(i)r )
(3–1)
35
for each neural channel in a gray scale gradient format. The darker bands represent
ratios larger than one and correspond to a higher probability for the movement class.
Lighter bands represent ratios smaller than one and correspond to a higher probability
for the rest class. The conditional probabilities nearly equal to one another show
up as gray bands, indicating that classification for the movement or rest classes is
inconclusive. As further support, Table 3-1 quantitatively shows that the individual neural
channels roughly classify the two classes of data better than chance. Essentially if
the likelihood ratio was larger than one the sample was labeled movement while less
than one became a label for rest. The model was trained on set of 8000 samples and
classification was computed on a separate test set of 3000 samples.
Overall, Figure 3-1 illustrates that the single-channel HMMs roughly classify
movement and rest segments from the neural data (better than random). Specifically,
the figure shows more white bands P(O |λr) >> P(O |λm) during rest segments and
darker bands P(O |λm) >> P(O |λr) during movement segments. Figure 3-2 is a
zoomed-in picture of Figure 3-1. This figure reveals that some of the neural channel
classifiers tune to certain parts of the trajectory like rest-food, food-mouth, and
mouth-rest. These sub-segmentations lend further support to our hypothesis that
primitives within the data exist.
Having observed the ability of some neurons to roughly classify the two classes
while other neurons were poor, the issue now becomes how to combine the information
properly while maintaining computational simplicity and the removal of VQ. The next
Table 3-1. Classification performance of example single-channel HMM chains
Neuron # Rest Moving
23 83.4% 75.0%
62 80.0% 75.3%
8 72.0% 64.7%
29 63.9% 82.0%
72 62.6% 82.6%
36
Figure 3-1. Probabilistic ratios of 14 neurons
Figure 3-2. Zoomed in version of the probabilistic ratios
sections detail the structure and training of different models that merge the classification
performance of these individual classifiers to produce an overall classification decision.
3.2 Independently Coupled HMMs
Structure and training for the ICHMM.
As discussed in Chapter 2, the CHMM models multiple channels of data without
the use of multivariate pdf’s on the output variables [44]. Unfortunately, the complexity
(O(TN2D) or O(T (DN)2)) and number of parameters necessary for the CHMM (and
37
variants) grows exponentially with the number of D chains (i.e. neural channels), N
states and T observation length. For these particular datasets, there is not enough data
to adequately train this type of classifier (under most training procedures). Therefore,
a model must be devised that supports the underlying biological system and can
successfully use the neural data without being intractable.
The neural science community offers a solution that overcomes these shortcomings
while still supporting the underlying biological system. The literature in this area explains
that different neurons in the brain may modulate independently from other neurons [3]
during the control of movement. Specifically, during movement, different muscles may
activate for synchronized directions and velocities, yet, are controlled by independent
neural assemblies or clusters in the motor cortex [3, 14, 51].
Conversely, within the neural clusters themselves, temporal dependencies (and
co-activations) have been shown to exist [3]. Therefore, this classifier assumes
that enough neurons are sampled from different neural clusters to avoid overlap
or dependencies. This assumption is further justified by looking at the correlation
coefficients (CC) between all the neural channels in our data set.
The best CC’s (0.59, 0.44, 0.42, 0.36) occurred between only four out of the 10,700
neural pairs while the rest of the neural pairs were a magnitude smaller (abs(CC)<0.01).
Additionally, despite these weak underlying dependencies, there is a long history of
making such independence assumptions in order to create models that are tractable or
computationally efficient. The Factorial Hidden Markov Model is one example amongst
many [44, 48]. Although an independence assumption is made between the neurons
to simplify the IC-HMM, other models will exploit these weak dependencies as well as
more complicated (spatial-temporal) dependencies in later sections.
By making an independence assumption between neurons, each neural channel
HMM is treated independently. Therefore the joint probability
P (O(1)T , O
(2)T , ...O
(D)T , |λfull) (3–2)
38
Figure 3-3. IC-HMM graphical model
becomes the product of the marginals
D∏i=1
P (O(i)T |λ(i)) (3–3)
of the observation sequences (each length T) for each d th HMM chain λ. Since the
marginal probabilities are independently coupled, yet try to model multiple hidden
processes, this naive classifier (Figure 3-3) is named the Independently Coupled Hidden
Markov Model (ICHMM) in order to keep the nomenclature simple.
By using an ICHMM instead of a CHMM, the overall complexity reduces from
(O(TN2D) or O(TD2N2)) to O(DTN2) given that each HMM chain has a complexity
of O(TN2). Since a single HMM chain is trained with a respective neural channel, the
number of parameters is greatly reduced, consequently reducing the requirements in the
amount of training data. Specifically, the individual HMM chains in the ICHMM contain
around 70 parameters for a training set of 10,000 samples as opposed to almost 18,000
parameters necessary for a comparable CHMM (due to the dependent states).
39
The detailed ICHMM structure is as follows:
1. Using a single neural channel d , the conditional probabilities are evaluated P (O(d)T |λ(d)
r )
and P (O(d)T |λ(d)
m ) , where,
O(d)T = {O (d)
t−T+1 ,O(d)t−T+2 ,O
(d)t−1 ,O
(d)t }(d),T > 1 (3–4)
and λ(d)r and λ
(d)m denote HMM chains that represent the two states of the monkey’s arm
(moving vs. rest). All of the HMM chains are previously trained with their respective
neural channels using the Baum-Welch algorithm [32, 42] described in Chapter 2.
Based on empirical testing, three hidden states and an observation sequence length T
of 10 are chosen for the model. With the dataset described, an observation length of 10
corresponds to a second of data (given the 100ms bins).
2. Normally, the monkey’s arm is decided to be at rest if, P (OT |λr) > P (OT |λm) and is
moving if, P (OT |λm) > P (OT |λr), but in order to combine the predictive powers of all the
neural channels, Equation ( 3–3) is used to produce the decision boundary
D∏i=1
P (O(i)T |λ(i)
m ) >
D∏i=1
P (O(i)T |λ(i)
r ) (3–5)
or more aptly,
l(O) =
∏Di=1 P (O
(i)T |λ(i)
m )∏Di=1 P (O
(i)T |λ(i)
r )> ζ (3–6)
where l(O) is the likelihood ratio, a basic quantity in hypothesis testing [38, 40].
Essentially, ratios greater than the threshold ζ are classified as movement and those
less than ζ as the rest class. The use of thresholds for the likelihood ratio has been
used in neural science and other areas of research [38, 40]. Often, it is more common
to use the log-likelihood ratio instead of the likelihood ratio for the decision rule so that
a relative scaling between the ratios can be found (as well as suppressing any irrational
ratios) [38]:
40
log(l(O)) = log(D∏
i=1
P (O(i)T |λ(i)
m )
P (O(i)T |λ(i)
r )) =
D∑i=1
log(P (O
(i)T |λ(i)
M )
P (O(i)T |λ(i)
r )) (3–7)
By applying the log to the product of the likelihood ratios, essentially the sum of the
log likelihood ratios is found to see if it is larger or smaller than a threshold (log ζ). In
simple terms, this decision rule poses the question of how large is the probability for one
class compared to the other and is it occurring over a majority of the single classifiers.
Note that by varying the threshold log(ζ), classification performance is tuned to fit any
particular requirements for increasing the importance of one class over another. For this
experiment equal importance is assumed for the classes (no bias for one or another).
Moreover, optimization of the classifier is now no longer a function of the individual HMM
evaluation probabilities, but rather a function of overall classification performance.
The next section outlines methods that move away from the naive assumption of
independence between the neural channels. Specifically, implicit relationships between
neurons will first be explored and then explicit relationships.
3.3 Boosted Mixtures of HMMs
3.3.1 Boosting
Boosting is a technique that creates different training distributions from an initial
input distribution so that a set of weak classifiers is generated [52, 53]. The generated
classifiers then form an ensemble vote for the current data example. These hierarchical
combinations of classifiers are capable of achieving lower error rates than the individual
base classifiers [52, 53]. Therefore boosting will be used to move beyond the IC-HMM
and exploit the complimentary information provided by the independent HMM chains.
Adaboost is the most widely used algorithm to evolve from boosting methods [54].
This algorithm sequentially generates weak classifiers based on weighted training
examples. Essentially, the initial distribution of training examples is re-sampled each
round (based on the distribution of the weights Wi) in order to train the next classifier up
41
to R rounds [54]. The training examples that fail to be classified on a particular round
receive an increased weighting so that the subsequent classifiers are more likely to be
trained on these hard examples. With the initial values of the weights being Wi= 1n, for
i=1,...,N samples, the update for each weight is
Wi ←Wi exp[αr· 1(y 6=fr(x))]
Zr
(3–8)
Where αr are the external weights for each r-th expert that has been trained for the
r-th round which act very similar to priors for the respective experts. Additionally, Zr is
a normalization factor in order to make the weights Wi a distribution. Concurrent to the
weights Wi for training examples, each αr is updated for the final ensemble vote (which
is a linear combination of the αr weights and the hypothesis of each expert).
αr = log1− errr
errr
(3–9)
where,
errr =N∑
i=1
Wi[yi 6= h(xi)] (3–10)
and the final ensemble output becomes
H(x) = sign [R∑
r=1
αrfr(x)] (3–11)
The success of boosting has been attributed to the distribution of the ”margins”
of the training examples [52]. For further details on Adaboost or boosting and the
relationship to the margin or support vector machines, see [54].
With respect to improving IC-HMM, Adaboost offers a promising way to implicitly
find dependencies among the channels through the process of boosting classifiers
during training. In the next section, Adaboost is applied differently to the multidimensional
neural data.
42
3.3.2 Modeling Framework
BMI data imposes specific constraints that require modifications to the standard
procedures: First, the neural data is high dimensional (100-200 channels) and the
importance of each channel is unknown. Second, the classes have markedly different
prior probabilities. Third, the computational practicality of using thousands of experts to
generate a decision is infeasible for hundreds of neural channels. Our approach uses
Adaboost with a competitive component as a way to select multichannel experts that
contribute the most information in the training set. Although the algorithm starts out with
parallel training for the independent experts, gradually a winning expert is chosen for
each hierarchical level to combine later into the ensemble. Since each level is formed in
an unsupervised way, this algorithm is a hybrid between supervised and unsupervised
learning, following a process similar to the Mixture of Experts framework [55].
The first major departure from Adaboost, comes from how the ensemble is
generated. Instead of forming one expert at a time, the C independent HMM chains
are trained in parallel using the Baum-Welch formulation (C=the number of neural
channels)[42]. This divides the joint likelihood into marginals so that independent
processes are working in simpler subspaces of the input [25]. Then a competitive
phase is initiated. Specifically, a ranking is performed with the experts and a winner
is chosen based on the classification performance for the current distribution of input
examples. The winner that minimizes the error with respect to the distribution of samples
is chosen. The minimal error is calculated using a Euclidean distance (also a departure
from Adaboost) in order to avoid biasing class assignments (since classes may not
have equal priors)[25]. Next, the remaining experts are trained within their respective
subspace but relative to the errors of the previous winner. Finally, the Wi are used to
select the next distribution of examples for the remaining experts. Similar to Adaboost,
the remaining experts are trained on the hard examples from different subspaces. In
turn, a hierarchical structure is formed as the winning experts affect the training on the
43
local subspaces for the subsequent experts. During this process, the model is implicitly
modeling the dependencies among the channels.
As explained earlier, Adaboost uses αc’s as external weights to the classifiers as
opposed to the Wi’s which weight the training examples. The computation of αm’s is
the second major departure from Adaboost since a mixture of experts formulation is
used for the external weights or mixture coefficients. Boosted Mixture of Experts (BME)
provides the inspiration in finding the mixture coefficients for the local classifiers [56].
With BME, improved performance is gained through the use of a confidence measure
for the individual experts [56]. Although many different confidence measures exist, the
majority use a scalar function of the expert’s output which is then used as a static gating
function or mixture coefficient [55, 56]. The algorithm uses a simple measure for each
expert based on the L2-Norm of the class errors (instead of the one outlined in equation
3–9)
αc = 1−√
err2M + err2
R (3–12)
where errM and errR are the respective errors of the two classes in our problem, Move
and Rest (which could generalize to more classes). The variable βc is substituted for the
normal Adaboost formulation of αc (Equation 3–9) to update the Wi’s in Equation 3–8.
Since there is a condition placed during the boosting phase to discard experts with
less than 50% classification, negative alphas will not occur [54]. Notice that as the errors
between the two classes are smaller, the weights for the experts become larger. The
proposed Adaboost training algorithm is presented below.
Given: (x1, y1), ..., (xn, yn) where xi ∈ X, yi ∈ Y = {−1, +1}Initialize Wi = 1
N, i = 1, ..., N samples
For l = 1, ..., L rounds
• Train all of the HMMs using samples from distribution Wi (with replacement)
44
• Find the c-th expert that minimizes the error with respect to the distribution Wi
errc = EW [1(y 6=fc(x))],
• Choose αc = 1−√
err2M + err2
R
βc = log1− errc
errc
• Update:
Wi ←Wi exp[βc· 1(y 6=fc(x))]
Zl
• The ensemble output:
H(x) = sign [C∑
c=1
αcfc(x)]
Where Zl is a normalization factor, so that Wi will be a distribution and∑
i Wi = 1.
The criterion for stopping is based on two conditions. The first stopping condition
occurs if the chosen experts are performing less than 50% classification. The second
stopping condition occurs if the cross validation set shows an increase in error or a
plateau in performance for a significant number of rounds.
With respect to other algorithms in the machine learning community, there are a
few ways to interpret the BM-HMM. BM-HMMs can be thought of as a modification to
boosting, or even a simpler version of the Mixture of Trees algorithm if the HMM chains
are interpreted as binary stumps [63]. Additionally, the temporal Markovian dynamics
coupled with the hierarchical structure and mixture modeling can be thought of as a
simple approximation to tree structured HMMs [41]. Other work has focused on solving
this problem of boosting multiple parallel classifiers [64]. Other authors have proposed
boosting solutions that reduce the dimensionality of the input data [64, 69, 70]. From
their perspective, the multidimensional inputs are treated as simple features of a single
45
random process to be modeled [64]. Our model is different since the input space is
treated as multiple random processes that are interacting with each other in some
unknown way. By decomposing the input space into multiple random processes, the
local contributions of the individual processes are exploited in a competitive fashion
rather than using the global effect of a single process. This type of algorithm also has
components similar to the Mixture of Experts (MOE) algorithm.
Since a single HMM chain is trained on a single neural channel, the number of
parameters is very small and can support the amount of training data. As discussed with
the IC-HMM, the individual HMM chains in the BM-HMM contain around 70 parameters.
In the next section, the complexity of the BM-HMM is slightly increased in order to
explicitly model dependencies between the neurons.
3.4 Linked Mixtures of HMMs
3.4.1 Modeling Framework
To move beyond the IC-HMM and implicit dependencies in the BM-HMM, another
layer of hidden or latent variables is established to link and express the spatial
dependencies between the lower level HMM structures (Figure 3-4), thus creating
a clique tree structure T (since there are cycles), where the hierarchical links exist
between neural channels.
The log likelihood of the dynamic neural firings from all of the neurons for this
structure (Figure 3-4) is
P (O|Θ) =∑Q
∑M
P (O,Q, M |Θ) (3–13)
46
Figure 3-4. LM-HMM graphical model
logP (O|S, M, Θ) = log P (M1) +N∑
i=2
log P (M i|M i−1, Θ) +
N∑i=1
(T∑
t=1
logP (Oit|Si
t , Θi) +
T∑t=1
logP (Sit |Si
t−1,Mi, Θi)) (3–14)
Where the dependency between the tree cliques are represented by a hidden
variable M in the second layer,
P (M i|M i−1, Θi) (3–15)
and the hidden state sequence S also has a dependency on the hidden variable M in
the second layer
P (Sit |Si
t−1,Mi, Θi) (3–16)
This hierarchical structure models the data across multiple neural channels while
also exploiting dependencies. Figure 3-4, shows how the lower observable variables Oi
are conditionally independent from the second layer hidden variable M i, as well as the
47
sub-graphs of the other neural channels T j (where i 6= j). The hidden variable M in the
second layer of Equation 3–14 is interpretable as a mixture variable (when excluding the
hierarchical links).
The LM-HMM implements a middle ground between making an independence
assumption and a full dependence assumption. Since a layer of hidden variables is
added, the computational cost increases. The next section will detail an approximation
that eases computational costs but still maintains the richness of modeling the
interrelationships between neurons.
3.4.2 Training with Expectation Maximization
Although using EM with graphical models provides insight into probability distributions
over the observed and hidden variables, some probabilities of interest are intractable
to compute. Instead of using brute force methods to evaluate such probabilities,
conditional independencies represented in the graphical model can be exploited. Often,
approximations like Gibbs sampling, variational methods, and mean field approximations
are applied in order to make the problem tractable or computationally efficient [41, 79].
For this model, a mean field approximation is used to allow interactions associated
with tractable substructures to be taken into account [79]. The basic idea is to associate
with the intractable distribution a simplified distribution that retains certain terms of the
original distribution while neglecting others, replacing them with parameters ui that
often referenced as ”variational parameters”. Graphically, the method can be viewed as
deleting edges from the original graph until a forest of tractable structures is obtained.
Edges that remain in the simplified graph correspond to terms that are retained in the
original distribution and edges that are deleted correspond to variation parameters
[79, 81].
Approximations are used to find the expectation of Equation 3–14. In particular,
P (Sit |Si
t−1,Mi, Θi)) is first approximated by treating M i as independent from S making
conditional probability equal to the familiar P (Sit |Si
t−1, Θi)). Two important features
48
are seen in this type of approximation. First, the simple lower-level HMM chains are
decoupled from the higher-level M i variables. Second, M i are now regarded as a linked
mixture variable for the HMM chains since P (M i|M i−1, Θi) which are addressed later
[55].
Because the lower-level HMMs have been decoupled, the Baum-Welch formulation
can now be used to compute some of the calculations in the E-step, leaving estimation
of the variational parameter for later. As a result, the forward pass is calculated as
E step:
αj(t) = P (O1 = o1, ..., Ot = ot, St = j|Θ) (3–17)
This quantity is calculated recursively by setting:
αj(1) = πjbj(o1) (3–18)
αk(t + 1) = [N∑
j=1
αj(t)ajk]bk(ot+1) (3–19)
The well known backward procedure is similar
βj(t) = P (Ot+1 = ot+1, ..., OT = oT |St = j, Θ) (3–20)
this computes the probability of the ending partial sequence ot+1, ...oT given the start at
state j at time t. Recursively, we define βj(t) as
βj(T ) = 1 (3–21)
βj(t) =N∑
k=1
ajkbk(ot+1)βk(t + 1) (3–22)
Additionally, the ajk and bj(ot) matrices are the transition and emission matrices
defined for the model which are updated in the M-step. Continuing in the E-step the
posteriors are rearranged in terms of the forward and backward variables. Let
γj(t) = P (St = j|O, Θ) (3–23)
49
which is the posterior distribution. Rearrange the equations to quantities to produce:
P (St = j|O, Θ) =P (O, St = j|Θ)
P (O|S, Θ)=
P (O,St = j|Θ)∑Nk=1 P (O, St = k|Θ)
(3–24)
and now with the conditional independencies the posterior is defined in terms of α’s
and β’s
γj(t) =αj(t)βj(t)∑N
k=1 αk(t)βk(t)(3–25)
We also define
ξjk(t) = P (St = j, St+1 = k|O, Θ) (3–26)
When expanded
ξjk(t) =P (St = j, St+1 = k, O|Θ)
P (O|S, Θ)=
αj(t)ajkbk(ot+1)βk(t + 1)∑Nj=1
∑Nk=1 αj(t)ajkbk(ot+1)βk(t + 1)
(3–27)
The M-step departs from the Baum-Welch formulation and introduces the variational
parameter [79]. Specifically, the M-step involves the update of the parameters πj, ajk, bL
(we will save ui for later)
M step:
πij =
∑Ii=1 uiγ
i1(j)∑I
i=1 ui
(3–28)
aijk =
∑Ii=1 ui
∑T−1t=1 ξi
t(j, k)∑Ii=1 ui
∑T−1t=1 γi
t(j)(3–29)
bij(L) =
∑Ii=1 ui
∑Tt=1 δot,vL
γit(j)∑I
i=1 ui
∑Tt=1 γi
t(j)(3–30)
There are two issues left to solve. First, how can the variational parameter be
estimated and maximized given the dependencies. Second, if experimentally it is not
known which neurons are affecting other neurons (if at all), how can the dependencies
between neurons be defined in the model.
50
3.4.3 Updating Variational Parameter
While still working within the EM framework, the variational parameters ui are
treated as mixture variables generated by the i’th HMM each having a prior probability
pi. The set of parameters are estimated to maximize the likelihood function [34, 80, 81]
n∏z=1
I∑i=1
piP (Ozi|Si, Θi) (3–31)
Given the set of sequences and current estimates of the parameters the E-step
consists of computing the conditional expectation of hidden variable M
uzi = E[M i|M i−1, Ozi|Θi] = Pr[M i = 1|M i−1 = 1, Ozi, Θi] (3–32)
The problem with this conditional expectation is the dependency on M i−1. Since
M i−1 is independent from Oi and Θi we decompose this into
uzi = E[M i|Ozi, Θi]E[M i|M i−1] (3–33)
The first term, a well-known expectation for Mixture of Experts, is calculated by
using Bayes rule and the priori probability that M=1
E[M i|Oi, Θi] = Pr[M i=1|Oi, Θi] =piP (Oi|Si, Θi)∑Ii=1 piP (Oi|Si, Θi)
(3–34)
Since the integration for the second term is much harder to compute, an integration
approximation is used that will also maintain the dependencies. Importance sampling
is a well-known method that is capable of approximating the integration with a lower
variance than Monte-Carlo integration [82]. We approximate the integration with
E[M i|M i−1] =1
n
n∑z=1
P (Ozi|Si, Θi)
P (Oz(i−1)|Si−1, Θi−1)(3–35)
51
Where the n samples have been drawn from the proposal distribution P (Ozi−1|Si−1, Θ).
For the estimation of ui we need to combine the two terms
uzi =piP (Ozi|Si, Θi)∑Ii=1 piP (Ozi|Si, Θi)
n∑z=1
P (Ozi|Si, Θi)
nP (Oz(i−1)|Si−1, Θi−1)(3–36)
To compute the M-step
pi =
∑nz=1 uzi∑I
i=1
∑nz=1 uzi
=
∑nz=1 uzi
n(3–37)
Borrowing from the competitive nature of the BM-HMM, winners are chosen based
on the same criterion of minimizing the Euclidean distance for the classes for the
LM-HMM.
3.5 Dependently Coupled HMMs
Up to this point, a variety of HMM models have been discussed that each determine
a coupling relationship between multiple channels. The model discussed in this section
will not only couple multiple HMM models, it will also directly characterize the coupling
relationships. Specifically, a new formulation to the Coupled Hidden Markov Model
(CHMM) (as described by [44]) will be used, which will be referred to as Dependently
Coupled Hidden Markov Model (DC-HMM) in order to maintain nomenclature. With
this formulation the joint conditional probability formulation is modeled as a linear
combination of marginal conditional probabilities with the weights represented
by coupling coefficients. Although some DC-HMM formulations have shown to be
computationally expensive and nearly intractable [44, 46] this new formulation alleviates
some of those obstacles with a simplistic approximation. Although computational
complexity is increased beyond the LM-HMMs, a beneficial insight is gained since the
underlying structure within the neural data is revealed through the model’s parameters.
Consequently there may be an opportunity to exploit the underlying structure for
modeling purposes or gain an understanding into the neurophysiologic interactions.
52
The next section will cover the related work on DC-HMMs and contrast the benefits
of Zhong’s [43] formulation. From there, the modeling framework is detailed as well
as the forward procedure. This forward procedure is important for the discussion on
clustering analysis and Neurophysiologic understanding in Chapter 6. Finally, the
learning algorithm for this DC-HMM formulation is presented with a discussion on the
coupling coefficients and how they characterize the coupling between channels.
3.5.1 Modeling Framework
The fully coupled architecture is powerful in modeling interactions among multiple
sequences. The joint transition probability is modeled as
P (S(c)t |S(1)
t−1, S(2)t−1, , , S
(C)t−1) =
C∑c=1
(Θc,cP (S(c)t |S(c)
t−1)) (3–38)
where Θc,c is the coupling weight from model c to model c, i.e. how much S(c)t−1 affects the
distribution of S(c)t . Which is controlled by P (S
(c)t |S(c)
t−1). Essentially, the joint dependency
is modeled as a linear combination of all marginal dependencies [43].
This formulation reduces the number of parameters compared to the standard
DC-HMM [43]. Of course this model still contains more parameters than multiple
standard HMMs and computationally more expensive than the LM-HMMs but by
computing the coupling relationships between neural channels there may be beneficial
insight into the microstructure in the neural data. The additional parameters are incurred
with the C2 transition probability matrices compared to only C in C standard HMMs.
There is also an additional coupling matrix Θ. Since the output symbols necessary for
the neural data is much larger than the number of hidden states, the increase in the
number of transition matrices does not increase the model complexity dramatically. As
compared to the standard formulation of the fully-coupled HMMs, this formulation is
easier to implement.
The DC-HMM model is characterized by the quadruplet λ = (π,A, B, Θ), where Θ
is the interaction parameters between channels (as shown in Figure 3-5). Assume there
53
are C coupled HMMs, the parameter space consists of the following components (with
stochastic constraints)
1. prior probability π = (π(c)j ), 1 ≤ c ≤ C, 1 ≤ j ≤ N (c)
N(c)∑j=1
π(c)j = 1 (3–39)
2. transition probability A = (a(c,c)ih ), 1 ≤ c, c ≤ C, 1 ≤ i ≤ N (c), 1 ≤ j ≤ N (c)
N(c)∑j=1
a(c,c)ij = 1 (3–40)
3. observation probability B = (b(c)j (k)), 1 ≤ c ≤ C, 1 ≤ j ≤ N (c), 1 ≤ k ≤ M
M∑
k=1
b(c)j (l) = 1 (3–41)
4. coupling coefficient Θ = (Θc,c), 1 ≤ c, c ≤ C
C∑c=1
Θc,c = 1 (3–42)
The forward-backward computation is illustrated in Figure 3-5. At each time slice
there are NC (if assume N (c) = N ) β’s, which is an exponential number with respect
to C. The computation complexity would be TNC thereby making it impractical to
compute the forward-backward variables for any realistic number of channels. To reduce
the computational complexity a modified forward variable is calculated for each HMM
model separately. The modified forward variable reduces complexity to O(TCN2) and is
calculated inductively as
a) Initialization:
α(c)1 (j) = π
(c)j b
(c)j (o
(c)1 ), 1 ≤ j ≤ N (c) (3–43)
54
Figure 3-5. DC-HMM trellis structure
b) Induction:
α(c)t (j) = b
(c)j (ot)
C∑c=1
Θc,c
N(c)∑i=1
(αct−1(i)a
(c,c)ij ), 2 ≤ t ≤ T (3–44)
c) Termination:
P (O|λ) =C∏
c=1
P (c) =C∏
c=1
(N(c)∑j=1
α(c)T (j)) (3–45)
Experimental results show that calculated P (O|Θ) is close to the true value of
P (O|Θ) [43].
3.5.2 Training
This subsection details the iterative optimization procedure for learning the
DC-HMM parameters using Zhong’s formulation. The proposed way to solve the
55
optimization problem is by using Lagrange multipliers. This constrained optimization
technique leads to an iterative re-estimation solution for all of the parameters. To simplify
the writing, only the transition matrix A will be discussed. Similar calculations are
applied to the remaining parameters π, B and Θ. Let L be the Lagrangian of P with
respect to the constraints associated with A
L = P +∑i,c,c
λ(c,c)i (
N∑j=1
a(c,c)ij − 1) (3–46)
where the λ(c,c)i are the undetermined Lagrange multipliers. P is locally maximized
when
a(c,c)ij =
a(c,c)ij ∂P/∂a
(c,c)ij∑N(c)
k=1 a(c,c)ik /∂a
(c,c)ik
(3–47)
While the likelihood function P (O|λ) is more complicated in this DC-HMM
formulation than in standard HMM case, P is still a homogeneous polynomial with
respect to each type of parameters (namely, π, A, B and Θ) and each type of the
parameters is still subject to stochastic constraints [43].
Overall, the re-estimation formula is guaranteed to converge to a local maxima
[43, 66]. The only difference from the standard HMM case is in calculating the first
derivative of the likelihood function. With standard HMMs, these derivatives reduce to
a form in which only the forward and backward variables are needed [43, 66]. In the
above formulation, the derivatives are calculated using back propagation through time.
Fortunately, computational complexity is not significantly increased since similar forward
procedures are applicable to the calculation of the derivatives. The detailed computation
of these first derivatives is presented in Appendix B.
56
CHAPTER 4BRAIN MACHINE INTERFACE MODELING: RESULTS
Although one of the goals for the dissertation is to find unsupervised neural
structures that correspond to underlying kinematic primitives, performance metrics
must first be decided in order to establish baseline results for the models. In the
following chapters, these metrics will serve as a way to compare the performance of
the unsupervised results. This type of comparison is necessary since there are no
known ground truths in real data.
Two metrics can be established by looking at the main BMI applications. The first
metric is classification performance. From a goal-oriented BMI perspective, classification
performance is an important metric in distinguishing between models that represent
goals or simple sub-goals. Although classification is limited with real neural data since
the classes are imposed by the user (using subjective kinematic features), this metric
will be useful for the simulated data since the classes are known. The second metric,
Correlation Coefficient (CC), is derived from current BMI research into trajectory
reconstruction. Most BMI researchers use the correlation coefficient between the
predicted trajectory of their models and the animal’s true arm trajectory in order to
assess the performance. Although the models discussed in this dissertation represent
discrete motion primitives or neural state structures, by treating these multiple models
as a front-end switch to respective Wiener filters, the continuous trajectory can be
reconstructed. Therefore the correlation coefficient becomes applicable as a metric.
Essentially, if the correlation coefficient increases, i.e. trajectory reconstruction is
improved, then the partitioning of the input space has merit.
This chapter will first present classification results on class labels obtained from
kinematic features. In order to assess the importance of these class labels and their
ability to partition the input space, continuous trajectories will be reconstructed in a
fashion similar to a Mixture of Experts framework (Figure 4-1): first the input data is
57
classified, then the neural data for the winning class trains a respective Wiener filter
(as discussed in Chapter 1). If the class labeling is appropriate, each Wiener filter will
develop a good model for a piece of the trajectory. Subsequently, performance should
improve beyond a single filter trained with the full trajectory. If the classification is bad,
the Wiener filters will mix the segments and performance should not be distinguishable
from a single Wiener filter since the individual filters will each generalize over the full
input space. Essentially all the individual filters will be equivalent, nullifying the effect of
switching between them.
Figure 4-1. Multiple model methodology
58
Table 4-1. Classification results(BM-HMM selectedchannels)
Model #channels % correctWith monkey data
IC-HMM 104 92.4%IC-HMM 9 87.1%BM-HMM 9 92.0%
Linear 104 88.3%Linear 9 86.9%
Table 4-2. Classification results(LM-HMM selectedchannels)
Model #channels % correctWith monkey data
IC-HMM 104 92.4%IC-HMM 9 89.5%LM-HMM 9 92.1%
Linear 104 88.3%Linear 9 87.8%
4.1 Monkey Food Grasping Task
4.1.1 Boosted and Linked Mixture of HMMs
The BM-HMMs and LM-HMM are both initialized three hidden states and an
observation sequence length T = 10, which corresponds to one second of data (given
the 100ms bins). These choices are based on previous efforts to optimize performance
[16]. For the linear classifier, a Wiener filter with a 10-tap delay (that corresponds to one
second of data) is used followed by a threshold for classification. Ten taps are used to
be consistent with the HMM work. For the monkey food grasping task, 8000 samples are
used to train the models while 2000 samples are used as a cross validation set to select
the channels. A separate test set of 5000 samples is used for the classification results
above. These parameters and thresholds (for the linear model) were chosen empirically
from multiple Monte-Carlo runs, based on previous work. In earlier work, ”Leave-K-Out”
methodologies were employed to ensure that the results on these data sets are general
[25].
To provide a fair comparison between the methods, the same neural channels that
were chosen by the BM-HMM are also used with the linear model and the IC-HMM.
Similarly, for the LM-HMM, a different set of neurons (although some overlap the
BM-HMM) were selected. The selection of neurons is based on the algorithm while the
number of neural channels is imposed in order to have equal number of channels for
comparison.
59
Tables 4-1 and 4-2, shows a comparison in results between the BM-HMM and
LM-HMM versus the full IC-HMM and the linear classifier created. The BM-HMM and
LM-HMM perform on par with the IC-HMM, but with the added benefit of dimensionality
reduction. Essentially, the same performance is obtained with a fraction of the number of
neural channels. This dimensionality reduction is very important when considering the
hundreds to thousands of channels that will be acquired in future BMI experiments [10].
In order to understand how the two methods differ in exploiting the channels,
Figures 4-2 and 4-3 show the correlation coefficients between the subset of channels
chosen by the BM-HMM and LM-HMM. Interestingly, more of the channels in the
LM-HMM have a large positive and large negative correlation (with respect to the move
class). In contrast, the BM-HMM has some positive correlation amongst its subset, but
few channels exhibit negative correlation.
Figure 4-2. Correlation coefficients between channels (monkey moving)
Tables 4-3 and 4-4 show the effect of randomly selecting the experts. Notice how
the results in these tables are poor in comparison to Tables 4-1 and 4-2. In particular,
Tables 4-1 and 4-2 show a significant decrease in performance when modeling the
monkey data. Interestingly, αm converges to similar values (like an averaging) for both
models. This averaging effect could be due to the dependency of randomly selected
neurons, (i.e. more independent). Since many of the channels have low firing rates
60
Figure 4-3. Correlation Coefficients between channels (monkey at rest)
or may not even correspond to the particular motions, some of these unimportant
neurons may also reduce performance. The averaging effect is further confirmed by
the results from the IC-HMM since similar performance is achieved with respect to
both models in Tables 4-3 and 4-4. Since the IC-HMM is an averaging model used for
independent neurons, similar performance on the random subsets would indicate that
dependencies are not being exploited by the models. Figure 4-4 shows the correlation
coefficient between the randomly selected channels. Since the colors are all light green,
the correlation between the channels is low to zero, further alluding to independence
between these particular channels. The results further suggests that the LM-HMM is
better at exploiting stronger correlated channels (i.e. dependent) than the BM-HMM, and
much stronger than randomly selecting channels (which may exhibit more independence
between channels).
Additionally, Figure 4-5 demonstrates an expert adding experiment in which the best
ranked experts are added one by one to the ensemble vote. As the BM-HMM chains
are added in Figure 4-5, an interesting result emerges, the error rate quickly decreases
below the IC-HMM error when applied to the monkey neural data. The LM-HMM shows
a faster drop in error, but does not achieve the same result as the BM-HMM. Overall,
61
Figure 4-4. Correlation coefficient between channels (randomly selected for monkey)
the boosted mixtures and linked mixtures are exploiting more useful and complimentary
information for the final ensemble than the simple IC-HMM.
Figure 4-6 presents the peri-event time histogram for all the neural channels in
parallel. These histograms present the firing rate of the different neurons averaged
across a single event (i.e the start of a movement trial) [71]. For these figures, the
firing rate is normalized so that the light colors illustrate decreased firing rates and the
darker colors indicate increased firing rates. Overlaid and stretched on each image
Table 4-3. Classification results(random BM-HMMselected channels)
Model #channels % correctWith monkey data
IC-HMM 104 92.4%IC-HMM 9 69.1%BM-HMM 9 69.4%
Linear 104 88.3%Linear 9 68.1%
Table 4-4. Classification results(random LM-HMMselected channels)
Model #channels % correctWith monkey data
IC-HMM 104 92.4%IC-HMM 9 63.5%LM-HMM 9 65.4%
Linear 104 88.3%Linear 9 63.2%
62
Figure 4-5. Monkey expert adding experiment
is the average of the movement trials. Interestingly, some channels have no pattern
associated with them whereas others have a consistent pattern. Also notice how the
BM-HMM selects neurons that display both an increase in firing activity as well as
neurons that decrease their firing during the onset of movement. The LM-HMM has also
been observed to use neurons that reduce their firing rate during movement. Overall,
linear or non-linear filters may not be able to take advantage of these types of neurons
whereas these graphical models incorporate the information as simply different states.
Overall the results demonstrate three interesting points. First, nine BM-HMM and
LM-HMM chains outperform the linear classifier that uses the full input space on the
monkey data. Second, the subset of experts that are chosen by the BM-HMM perform
well on the linear model. This result is expected since the BM-HMM chains select
neural channels with important complimentary information. Third, when comparing
the BM-HMM and LM-HMM to the linear classifier and the IC-HMM using the same
subset of neural channels, the results show that the hierarchical training of the BM-HMM
63
Figure 4-6. Parallel peri-event histogram for monkey neural data
and LM-HMM provides a significant increase in performance. This increase is related
to the dependencies that are being exploited during each round of training, where
the other models simply try to uniformly combine all the neural information into a
single hypothesis. Tables 4-3 and 4-4 support this hypothesis since the LM-HMM and
BM-HMM default to an independent approximation. Finally, other BMI researchers apply
sensitivity analysis to understand the importance of a neural channel respective to the
kinematics performed by the subject. In contrast the BM-HMM and LM-HMM channel
selection is trying to improve classification results by exploiting dependencies between
channels as well as kinematics. Interestingly, some of the channels selected in both data
sets do overlap some of the same neurons selected during sensitivity analysis [20, 67].
4.1.2 Dependently Coupled HMMs
Using the same classes as described for the BM-HMM and LM-HMM, a DC-HMM
(as described earlier) is trained for each class with 5000 data points. Based on empirical
testing with multiple Monte Carlo simulations the observation length is set to ten
64
(1 sec of data), the number of hidden states to three. The chosen neurons are the
same as those selected by the LM-HMM. In order to show the benefit of partitioning
the input space with the DC-HMM, Wiener filters (one per class) are used again in a
second-phase to reconstruct the trajectory. These filters are trained as described in
Appendix A, with 10 tap delays. Table 4-5 shows the correlation coefficient results after
using the DC-HMM to partition the input space for trajectory reconstruction by respective
Wiener filters. The table shows that the DC-HMM with 6 neurons does not perform as
well as the LM-HMM but is far better than using random labels or a single Wiener filter.
The results in the table also show that adding more neurons improves the correlation
coefficient (but not beyond the LM-HMM).
Table 4-5. Correlation coefficient using DC-HMM on 3D monkey data
Experiment CC
DC-HMM (6 Neurons) .81±.18
DC-HMM (29 Neurons) .82±.15
LM-HMM .84±.13
Single Wiener .76±.19
NLMS .75±.20
TDNN .77±.17
Table 4-6. NMSE on 3D monkey data
Experiment NMSE
DC-HMM (6 Neurons) .26
DC-HMM (29 Neurons) .23
LM-HMM .22
Single Wiener .36
Figure 4-7 shows the DC-HMM reconstruction of the kinematic out of the food
grasping task where as Figure 4-8 shows the true trajectory. Qualitatively the trajectories
are similar, confirming the correlation coefficients shown above. Additionally, the
normalized mean squared error (NMSE) was computed for the different models showing
65
the graphical models outperforming the single Wiener filter. A t-test was applied to
the output of the DC-HMM based bi-model model and the single Wiener filter with a
significance level of 0.05. The null hypothesis is rejected , resulting in the p-value of
0.02.
Figure 4-7. Supervised monkey food grasping task reconstruction (position)
Figure 4-9 shows the Viterbi hidden state path for the different neurons across
three hidden states. Essentially, the three dimensions (X,Y, and Z) of the monkey’s
arm is overlaid over the state paths. On the Y axis there are 24 states (6 neurons with
four states) at each time bin. The dark lines indicate the largest α (equation 3–9) at
each time bin. The left figure is the Viterbi path for the DC-HMM trained on movement
data while the right figure is for the data when the monkey is at rest. The figures depict
repeating patterns that correspond to the kinematics. Also notice that the states from
different neurons are dominating across the data set for each of the classes. Although
not pertinent to the results in this chapter, these Viterbi paths inspire further exploration
in Chapter 6.
66
Figure 4-8. 3D monkey food grasping true trajectory
Figure 4-9. Hidden state space transitions between neural channels (for move and rest)
67
Figure 4-10 shows the coupling coefficients for each class (move/rest) for the six
neurons in the subset. The coupling coefficient is from the previous neuron’s state to the
next. Brighter colors indicate a larger value for the coupling coefficient. The left figure
presents the DC-HMM trained with movement data while the right is for the model of
rest. The diagonal that forms in the figure is the coupling coefficient between the same
neurons (indicating preference for staying in the same state or neuron). Despite some
neurons illustrating weak dependency, there is stronger evidence for independence
between neurons. This result may help to explain why even the linear filters are able
to work well with these neurons (since DC-HMM has less to exploit). Also notice a few
neurons dominate the models which reinforces what is shown in the previous Figure 4-9.
Figure 4-10. Coupling coefficient between neural channels (3D monkey experiment)
4.2 Rat Single Lever Task
For the following experiments, the HMMs had three hidden states and an observation
sequence length T = 10, which corresponds to one second of data (given the 100ms
bins). For the linear classifier, a Wiener filter with a 10-tap delay (that corresponds to
68
one second of data) is used. For the Rat lever task, 5000 samples are used to train the
models while 2000 samples are used as a cross validation set to select the channels. A
separate test set of 3000 samples are used for the classification results above. These
parameters and thresholds (for the linear model) were chosen empirically from multiple
Monte-Carlo runs and based on previous work. As with the monkey grasping task,
”Leave-K-Out” methodologies were employed to ensure generalization of the results on
these data sets.
The same neural channels that were chosen by the BM-HMM for the Rat lever
task are also used with the linear model and the IC-HMM in order to provide a fair
comparison between the methods. Similarly, for the LM-HMM, a different set of neurons
(although some overlap the BM-HMM) were selected based on the algorithm but the
number of final channels were kept the same.
Table 4-7. Classification results(BM-HMM selectedchannels)
Model #channels % correctWith rat data
IC-HMM 16 62.5%IC-HMM 6 58.3%BM-HMM 6 64.0%
Linear 16 61.8%Linear 6 56.9%
Table 4-8. Classification results(LM-HMM selectedchannels)
Model #channels % correctWith rat data
IC-HMM 16 62.5%IC-HMM 6 56.5%LM-HMM 6 62.3%
Linear 16 61.8%Linear 6 55.2%
Tables 4-7 and 4-8 shows a comparison of the classification results from the
BM-HMM and LM-HMM versus the full IC-HMM and a simple linear classifier created
by a regression model followed by a threshold [25]. Comparing these tables with 4-1
and 4-2, the classification performance on monkey data is better than on the rat data.
The more important point is that again the BM-HMM and LM-HMM perform on par with
the IC-HMM, but with the added benefit of dimensionality reduction. Effectively, the
same performance is obtained with a fraction of the number of neural channels.
69
Figure 4-11, the parallel peri-event histogram, shows that the models also capture
rat neurons that dramatically reduce their firing rate during movement.
Figure 4-11. Parallel peri-event histogram for rat neural data
Tables 4-9 and 4-10 show the effect of randomly selecting the experts. Notice the
results in these tables are poor in comparison to Tables 4-7 and 4-8. The performance
reduction is less pronounced for the data collected from the Rat than from the Monkey
food grasping task. The discrepancy is likely due to the available number of channels
that are randomly selected. Since the monkey data is collected from a greater number of
channels, there is a higher likelihood of selecting less important neurons for training.
Figure 4-12 presents the results from an expert adding experiment in which the
best ranked experts are added one by one to the ensemble vote. The figure shows
that as the BM-HMM chains are added, an interesting result emerges, the error rate
quickly decreases below the IC-HMM error when applied to the rat neural data. The
LM-HMM has an even faster drop in error, but does not achieve the same result as the
BM-HMM. Similar to the results for the monkey food grasping task, the boosted mixtures
70
Table 4-9. Classification results(random BM-HMMselected channels)
Model #channels % correctWith rat data
IC-HMM 16 62.5%IC-HMM 6 56.5%BM-HMM 6 56.9%
Linear 16 61.8%Linear 6 55.1%
Table 4-10. Classification results(random LM-HMMselected channels)
Model #channels % correctWith rat data
IC-HMM 16 62.5%IC-HMM 6 55.4%LM-HMM 6 54.3%
Linear 16 61.8%Linear 6 54.8%
and linked mixtures are exploiting more useful and complimentary information for the
final ensemble than the simple IC-HMM for the Rat lever task.
Figure 4-12. Rat expert adding experiment
71
4.3 Monkey Cursor Control Task
4.3.1 Population Vectors
The last two sections demonstrated that partitioning the input is beneficial with
simple go-no-go experiments for both animal experiments (rat and monkey). The focus
now turns to more complicated motion primitives other than movement and rest. To
accomplish this goal, it is first important to understand what information the neurons
provide about the different arm kinematics. Specifically, motor cortex neurons are known
to initiate motor commands that then dictate the limb kinematics. Therefore this section
will focus on movement direction since it is the most natural way to navigate to a goal[9].
The population vector method [14] was one of the first methods that tried to address the
complicated relationship between movement direction and motor cortical activity. This
method links high neural modulation with preferred movement direction.
In order to implement the population vector, tuning curve statistics are computed
for each of the neurons. These tuning curves provide a reference of activity for different
neurons. In turn, the neural activity relates to a kinematic vector, such as hand position,
hand velocity, or hand acceleration, often using a direction or angle between 0 and 360
degrees. A discrete number of bins are chosen to coarsely classify all the movement
directions. For each direction, the average neural firing rate is obtained by using a
non-overlapping window of 100ms. The preferred direction is computed using circular
statistics as
circular mean = arg(∑N
rNeiΘN ) (4–1)
where rN is the neuron’s average firing rate for angle ΘN , and N covers the full
angular range. Figure 4-13 shows an example polar plot of four simulated neurons and
the average tuning information with standard deviation across 100 Monte Carlo trials
evaluated for 16 min duration. The computed circular mean, estimated as the firing
rate weighted direction, is shown as a solid red line on the polar plot. The figure clearly
72
indicates that the different neurons fired more frequently toward the preferred direction.
Additionally, in order to get the statistical evaluation between Monte Carlo runs, the
traditional tuning depth were not normalized to (0, 1) for each realization as normally
done in real data. To calculate the tuning depth:
Tuning Depth =max(rN)−min(rN)
std(rN)(4–2)
Figure 4-13. Neural tuning depth of four simulated neurons
After preferred directions are acquired, the vectors of angles from predicting
neurons are added to create the trajectory in the population vector method. Although the
results are far from great [14], this type of analysis may lend some insight into a better
segmentation of the input space (relative to angular directions).
73
Figure 4-14. Histogram of 30 angular velocity bins
Since angular values exist on the real line, it is necessary to quantize or bin
the angles. Figure 4-14 shows the histogram for the 30 different angular bins and
the number of examples per bin (i.e. the polar plot is stretched horizontally). Each
bin represents a range of angular velocities (12◦). The figure also presents two
distinguishable peaks in the histogram, this is due to monkey’s hand movement making
predominantly diagonal movements (figure 4-15).
4.3.1.1 A-priori class labeling based on population vectors
With a method that links neural modulation to particular angles, it would interesting
to see if the binned angles themselves are appropriate as a class label. Since some
of the neurons are modulated across many angular bins, a method must be found to
compare the tuning curves. Using the analysis described in the last section, Figure 4-16
provides a comparison between the tuning curves of the neurons drawn in parallel.
Essentially each neuron’s tuning depth is drawn horizontally (x-axis) rather than on a
polar plot (angular bins start from left 0◦ to right 360◦). The depths are then plotted in
74
Figure 4-15. 2D angular velocities
parallel along the y-axis (185 neurons). The darker pixels indicate a higher firing rate for
a particular angle. Since all the tuning depths are normalized, the figure clearly shows
that some neurons are tuned to different angles relative to other neurons. Figure 4-16B
is a plot of the maximum depth at a particular angular bin. Interestingly, Figure 4-16B
shows that a small subset of neurons have a very high firing rate (normalized) during
particular angles. Interestingly, some neurons modulate across multiple bins (i.e. a wide
range of angles).
4.3.1.2 Simple nave classifiers
In this section, two simple winner-take-all experiments are conducted in order to
test if the angular bins could serve as class labels. For the first experiment,the mean
and variance are computed for the firing rate for each neuron (and each angular bin)
with an assumed Gaussian distribution. At each time instance, the largest probability
is found for a particular angular bin consequently assigning the particular time bin with
the respective angle. In experiment two, the mean and variance are also computed on
75
Figure 4-16. A. Parallel tuning curves B. Winning neurons for particular angles
the firing rate for each neuron (with Gaussian assumption). Departing from experiment
one, the probability of each angular bin is treated as a marginal for each neuron (at
each time instance). The largest joint probability is then found for each angle to base
the classification decision. For both experiments, classification performance is the only
metric considered.
Unfortunately the results from both experiments are poor. In experiment one; the
angular bins could only be correctly classified 10% of the time. Similarly, experiment
two produces poor results. Additionally, classification does not significantly improve
when more quantized bins are used. Figure 4-17 shows the histogram for 10 Bins
(36 ◦). Figures 4-18A, and 4-18B demonstrate similar tuning but at a more quantized
level. Notice in Figure 4-18B that fewer neurons are covering all of the angles (which
is to be expected with larger quantized bins). Overall, these results show that simple
modulations are not enough to classify the neural input and that more complicated
modeling structures are necessary.
4.3.2 Results for the Cursor Control Monkey Experiment
Although segmenting the neural data into angular velocities did not work in the
simple winner take-all experiments above, the next step is to test the hypothesis
that the kinematic reconstruction improves if the input is segmented into the binned
76
Figure 4-17. Histogram of 10 angular bins
Figure 4-18. A. Parallel tuning curves B. Winning neurons for particular Angles
angular velocities. Although segmenting the neural input is known to improve trajectory
reconstruction with multiple models on food reaching tasks, this methodology has
not been tested with continuous 2D trajectories. By using angular velocities for
segmentation, the LM-HMMs and DC-HMMS should capture the neurons that are
modulating for particular angles. With these models isolated for a particular angular
77
velocity, performance should improve. Although this methodology is supervised, a
baseline is established for the unsupervised learning and will serve to reinforce the
hypothesis that partitioning the input space is beneficial. On a side note, the number of
samples per class and the predominate kinematic features (per class) are imposed onto
the generative models when training for these supervised experiments. Even though the
angular velocity serves as the predominate kinematic feature in these experiments, later
in unsupervised experiments; the imposed features may not be predominate features
that the generative models capture.
Table 4-11. Correlation coefficient using different BMI models
Experiment CC(X) CC(Y)
NMCLM(FIR) .67±.03 .48±.07
NMCLM(Gamma) .67±.02 .47±.07
Single Wiener .66±.02 .48±.10
NLMS .68±.03 .50±.08
Gamma .70±.02 .53±.09
Subspace .70±.03 .58±.10
Weight Decay .71±.03 .57±.08
Kalman .71±.03 .58±.10
TDNN .65±.03 .51±.08
Before describing the experiments in detail, Table 4-11 shows the correlation
coefficients for the different BMI models that have been used on this particular cursor
control dataset. Since the following experiments use the same data set, it is appropriate
to compare these correlation coefficients to those produced by our methods. Of
particular interest is the Non-Linear Mixture of Competitive Linear Models (NMCLM)
since the methodology for constructing the trajectory is similar to above discussed
multiple-model approach. Specifically, the NMCLM divides the input space and applies
a switching mechanism to select a winning Wiener filter to construct the trajectory in a
piecewise fashion [19].
78
For the following experiments, several angular binned velocities are used as the
class label. The neurons that are modulating for particular angles will be isolated and
help in training Wiener filters that learn only a homogeneous portion of the input/output
space (corresponding to an angular bin). Based on empirical and previous results,
ten taps are used for both the single filter and multiple filters (one filter per class).
Subsequently this amount of data corresponds to one second of time (ten 100ms
time bins). The one second of data is comparable to the time embedding used by
the methods in Table 4-11. The correlation coefficient of the reconstructed trajectory
provides a metric and baseline comparison for the unsupervised results in Chapter 5.
For the experiments below, a training set of 5000 samples is used along with a
test set of 3000 samples. Training with the classes is a two-step process. First the
LM-HMM and DC-HMM are trained on the segmented data and then the parameters
are frozen. Then the linear models are trained with neural data that has been classified
as a particular class by the LM-HMM or DC-HMM. The weights are then frozen for the
linear models. During testing, the LM-HMM and DC-HMM are computed with the test set
and switches the input to the corresponding linear filter which then computes the output
trajectory. As mentioned, the angular bins can be quantized into many angular bins,
through empirical testing four classes or angular bins were selected.
With respect to the other models used on this cursor control task, Table 4-12
demonstrates that again the DC-HMM is not quite as good as the LM-HMM but both
are still better than the single Wiener filter and TDNN (while dramatically better than the
NMCLM).
Figure 4-19 shows the reconstruction (red) versus the true trajectories (blue) of
the monkey cursor control task. Overall, there is some benefit in using the DC-HMM.
Perhaps the subdued results are due to an independent set of neurons acquired in the
data set. With such data, the DC-HMM would not be as much benefit since it could not
exploit as much information.
79
Figure 4-20 shows the Viterbi state transitions for the different neurons chosen
for the DC-HMM. Along the Y-axis are the different neural states (24 total: four states
for each six neurons). Each subplot represent one of the four different classes. Notice
how different neurons are preferred over others with respect to each class. Some
neurons predominately transition to themselves rather than other neurons. These
predominant transitions were also observed in the last section with the food grasping
task. Unfortunately, discerning a repeating pattern is difficult with this dataset since the
cursor task is not repetitive like the food grasping task.
Figure 4-21 further confirms that most of the neurons are only coupled to themselves
more so than the other neurons. The figure shows the four different classes and the
coupling coefficients between neurons.
Table 4-12. Correlation coefficient using DC-HMM on 2D monkey data
Experiment CC(X) CC(Y)
DC-HMM .74±.08 .67±.13
LM-HMM .79±.04 .75±.11
Single Wiener .66±.02 .48±.10
TDNN .65±.03 .51±.08
NMCLM(FIR) .67±.03 .50±.07
NMCLM(Gamma) .67±.02 .47±.07
80
Figure 4-19. True trajectory and reconstructed trajectory (DC-HMM)
Figure 4-20. Hidden state transitions per class (cursor control monkey experiment)
81
Figure 4-21. Coupling coefficient between neurons per class(cursor control monkeyexperiment)
82
CHAPTER 5GENERATIVE CLUSTERING
5.1 Generative Clustering
Chapter 4 demonstrated that by using simple angular bins or movement/rest as
class labels to partition the neural input, the trajectory reconstruction is remarkably
improved. Consequently, the improvements provided evidence that the neural activity in
the motor cortex transitions through multiple states during movement [13, 35] and that
the switching behavior is exploitable for BMIs.
Unfortunately, to partition the neural input space, a-priori class labels are needed
for separation. However, under real conditions with paraplegics, there are no kinematic
clues to separate the neural input into class labels (or clusters). Currently, most of the
behaving animals engaged in BMI experiments are not paralyzed, allowing the kinematic
information to be used for training the models. This Achilles heel plagues most BMI
algorithms since they require kinematic training data to find a mapping to the neural
data.
Since kinematic clues are not available from paraplegics, neural data must be
exclusively used to find a separation. Finding neural assemblies or structures may
offer a solution. The hypothesis argued throughout this dissertation is that there are
multiple neural structures corresponding to motion primitives. Initial supervised results
support this hypothesis[16]. Therefore the goal is to find a model that can learn these
temporal-spatial structures or clusters and segment the neural data without kinematic
clues or features (i.e. unsupervised). In this chapter, the LM-HMM and DC-HMM
models are combined with a clustering methodology in order to cluster neural data.
These models are chosen for their ability to operate solely in the input space and their
ability to characterize the temporal spatial space at a reduced computational cost.
The methodology described in the next section will explain how the models learn the
83
parameters and structure of the neural data in order to provide a final set of class or
cluster labels for segmentation.
Clustering framework.
This section establishes a model-based method for clustering the spatial-temporal
neural signals using the LM-HMM or DC-HMM. In effect, the clustering method tries
to discover a natural grouping of each exemplar S (i.e. window of multidimensional
neural data) into K clusters. A discriminate (distance) metric similar to K-means is used
except that the vector centroids are now probabilistic models (LM-HMMs or DC-HMMs)
representing dynamic temporal data [74].
The bipartite graph view (Figure 5-1) assumes a set of N data objects D (e.g.,
exemplars, represented by S1, S2, .., SN , and K probabilistic generative models (e.g.,
LM-HMMs or DC-HMMs), λ1, λ2, ..., λK , each corresponding to a cluster of exemplars (i.e
windows of data) [75]. The bipartite graph is formed by connections between the data
and model spaces. The model space usually contains members from a specific family of
probabilistic models. A model λy can be viewed as the generalized ’centroid’ of cluster y,
though it typically provides a much richer description of the cluster than a centroid in the
data space. A connection between an object S and a model λy indicates that the object
S is being associated with cluster y, with the connection weight (closeness) between
them given by the log-likelihood log p(S|λy).
A straightforward design of a model-based clustering algorithm is to iteratively
retrain models and re-partition data objects. Essentially, clustering is achieved by
applying the EM algorithm to iteratively compute the (hidden) cluster identities of data
exemplars in the E-step and estimate the model parameters in the M-step. Although
the model parameters start out as poor estimates, eventually the parameters converge
to their true values as the iterations progress. The log-likelihoods are a natural way to
provide distances between models as opposed to clustering in the parameter space
(which is unknown). Basically, during each round, each training exemplar is re-labeled
84
Figure 5-1. Bipartite graph of exemplars (x) and models
by the winning model with the final outcome retaining a set of labels that relate to
a particular cluster or neural state structure for which spatial dependencies have
also been learned. The dependency structure is learned during the inner-loop of the
LM-HMM or DC-HMM training.
Setting the parameters is daunting since the experimenter must choose the number
of states, the length of the exemplar (window size) and the distance metric (e.g.
log-likelihood). To alleviate some of these model initialization problems, previous
parameter settings found during early work are used for these experiments [16].
Specifically, an a-priori assumption is made that the neural channels are of the same
window size and same number of hidden states. The clustering framework is outlined
below:
Let data set D consist of N sequences for J neural channels, D = S11 , ..., S
JN ,
where Sjn = (Oj
1, ...OjT ) is a sequences of observables length T and Λ = (λ1, ...λK) a
85
set of Models. The multiple sequences in a window of time (size T) is referred to as an
exemplar. The goal is to locally maximize the log-likelihood function
logP (D|Λ) =∑
Sji∈D
logP (Sjn|λy(Sj
n)) (5–1)
1. Randomly assign K labels (with K < N ), one for each windowed exemplar Sn,
1 ≤ n ≤ N . The LM-HMM parameters are initialized randomly.
2. Train each assigned model with the respective exemplars using the LM-HMM
or DC-HMM procedure discussed earlier. During this step the model learns the
dependency structure for the current cluster of exemplars.
3. For each model evaluate the log-likelihood of each of the N exemplars given model λi,
i.e., calculate Lin =log L(Sn|λi), 1 ≤ n ≤ N and 1 ≤ i ≤ K. y(Sn) =argmaxy log L(Sn|λi)
is the cluster identity of the exemplar. Then re-label all the exemplars based on cluster
identity to maximize equation 5–1.
4. Repeat steps 2 and 3 until convergence occurs or until a percentage of labeled
exemplars does not change (i.e. set a threshold for changing exemplars). More
advanced metrics for deciding when to stop cluster could be used (like KL divergence
etc).
5.2 Simulations
5.2.1 Simulated Data Generation
Since there are no known ground truths to label real BMI neural data, simulations
on plausible artificial data will help support the results found by the clustering framework
on real data.
The first series of simulations will consist of independent neurons generated from
a realistic neural model. Although many neural models have been proposed [76, 77],
the Linear-Nonlinear-Poisson (LNP) model is selected for the following simulations since
different tuning properties are selectable and can therefore generate more realistic
86
neural data. The LNP model consists of three stages. The first is a linear transformation
that is then fed into a static non-linearity to provide the conditional firing rate for a
Poisson spike generating model at the third stage [76, 77].
Two simulated independent neural data sets are generated in the following
experiments. One data set contains four neurons tuned to two classes (Figure 5-2)
and a second data set contains eight neurons tuned to four classes (one pair of neurons
per class). To create these datasets first, velocity, time series is generated with 100
Hz sampling frequency and 16 min duration (1000000 samples totally). Specifically,
a simple 2.5 kHz cosine and sine function is used to emulate the kinematics (X-Y
Velocities) for the simulation experiments. Then the entire velocity time series (for both
data sets) is passed through a (LNP) model with the assumed nonlinear tuning function
in Equation 5–2.
λt = exp(µ + β~vt~Dprefer) (5–2)
where λt is the instantaneous firing probability, µ is the background firing rate (set to
.00001). The variable β represents the modulation factor for a preferred direction which
is set monotonically from 1 to 4 for the four neurons in the two class simulation and a
value of 3 for the eight neurons in the four class simulation. The unit vector ~Dprefer is the
preferred angular direction of the kinematics which is set to π4
and 5π4
for the two class
simulation and π4, 3π
4, 5π
4, 7π
4for the four class simulation. The spike train is generated
by an inhomogeneous Poisson spike generator using a Bernoulli random variable with
probability λ(t)∆t within each 1ms time window. Once the spike trains are generated,
they are binned into 100ms bins while the velocity ~vt data is down-sampled accordingly.
For each data set, an additional 100 channels of uniformly distributed fake spike
trains (also 16mins each) are combined with the two data sets to create an artificial
neural data set with a total of 104 and 108 neurons. Essentially, adding the extra fake
channels allows for less than an 8% chance for the true channels to be randomly
selected by the generative models.
87
The second series of simulations consist of dependent neurons in which the
temporal state transitions and dependencies are pre-defined. To produce samples, a
graphical model similar to the DC-HMM is initialized for each desirable class/cluster.
The parameters are selected so that overlap between the models of each class/cluster
is reduced. A Gibbs sampler is then employed on each respective model to generate
exemplars of data. Although the Gibbs sampler is not the only sampler available, it is
very easy to implement, since sampling is achieved with the conditional distributions
between the nodes rather integrating over the joint (since the conditional distributions
are pre-defined) [79]. After all the samples are generated for each respective class,
the exemplars from each class are artificially placed in an alternating pattern. In the
following dependent neural simulations, four neurons were created with each class
producing 5000 samples. The data sets also contain 100 channels of fake neurons in
order to assess the robustness of the models (as explain in prior sections).
88
Figure 5-2. Neural tuning depth of four simulated neurons
89
5.2.2 Independent Neural Simulation Results
Clustering with the LM-HMM.
Figure 5-3 demonstrates the clustering results using the LM-HMM on the two-class
simulated data set consisting of independent neurons. For this particular experiment,
the model parameters are set (during training) for two classes (k = 2) which is equal to
the true number of classes in the simulation data. Additionally, the class labels alternate
since they represent the alternating kinematics (shown at the bottom of figure). As
seen from the figure, the model is able to correctly cluster the data in a relatively small
number of iterations (three to four). For the first iteration, each exemplar in the full data
set is randomly assigned to one of the clusters (indicated by green and blue colors). For
the remaining iterations, a pattern starts to emerge that looks similar to the alternating
kinematics. Although the kinematics (cosine and sine wave) are shown below the class
labels, the clustering results were acquired solely from the input space.
Figure 5-3. LM-HMM cluster iterations (two classes, k=2)
90
Figure 5-4 shows the class tuning preference when the model is initialized with
random data. The term ’tuning preference’ refers to the angular preference of the
particular class (or cluster) label. The quantity is calculated the same way as in neural
tuning, except that the data is collected from the samples that have been labeled by a
particular class (i.e circular statistics are calculated on the kinematics from class 3 rather
than a neuron). This figure clearly shows that before modeling the random class labeling
has not introduced preferences for any particular angle.
Figure 5-4. Tuning preference for two classes (initialized)
Figure 5-5 shows the angular preference of the classes after clustering. Overlaid
in blue is the original angular tunings of some of the neurons. Clearly the model is able
to successfully find the separation in neural firings, since each class is represented
by a different angular tuning (similar to respective neurons). Although angular velocity
shows itself to be a useful kinematic feature to separate the clusters, the models are not
solely restricted to this type of feature. Only in this experiment did the most prevalent
feature appear to be velocity. In later experiments with simulated and real neural data,
this kinematic feature will not be the most obvious.
For the simulation in Figure 5-6, the LM-HMM is used to cluster a four-class
simulated data set (k = 4). Again, the correct number of clusters k = 4 is set during
training to match the true number of classes in the input data (another oscillating
91
Figure 5-5. Tuned classes after clustering (two classes)
pattern). The figure shows a few issues with shrinkage and expansion with respect
to the class labels. In other words some of the class labels are extended or short
of the actual size. These effects are due to the temporal-spatial data (not static
classification) and the model’s propensity to stay in a particular state. Overall the
final result demonstrates that the clustering model with the LM-HMM is able to discover
the underlying clusters present in the simulated independent neural data.
Figure 5-7 shows the clustering on the initial random labels while Figure 5-8 shows
the tuned preference of the four classes after clustering. Remarkably, the model is able
to determine the separation from the four classes using the neural input only. Next, the
clustering model is tested for robustness when the number of clusters is unknown or
increased noise is added to the neural data.
As with all clustering algorithms, choosing the correct number of underlying clusters
is difficult. Choosing the number of clusters for BMI data is even more difficult since
there are no known or established ground truths (with respect to motion primitives).
Figure 5-9, illustrates when the clustering model is initialized with four classes (k = 4)
despite the simulation only containing two underlying classes (or clusters) for the input
space. Again, the results are generated within a relatively small number of iterations.
Notice from the figure that the extra two class labels are absorbed into the two classes
92
Figure 5-6. LM-HMM cluster iterations (four classes, k=4)
shown in the previous Figure 5-3 (also shown below the four class labels). Interestingly,
a repeated pattern of consistent switching occurs with the class labels (as indicated by
the pattern of color blocks). Specifically, Figure 5-9 shows that class 1 precedes class 3
and class 2, when combined, they correspond to class 1 in Figure 5-3, while class 4 in
figure 5-9 corresponds to class 2 in Figure 5-3. Remarkably, the neural data from such a
simple simulation is complicated yet the clustering method finds the consistent pattern of
switching (perhaps indicating that the simple classes are further divisible).
To further test robustness, random spikes are added to the unbinned spiked trains
of the earlier tuned neurons (of Figure 5-2). Specifically, uniformly random spikes are
generated with a probability of spiking every 1ms. Figure 5-10 shows the classification
performance as the probability of firing is increased from a 1% chance of spiking to 16%
chance of spiking in 1ms. Interestingly, performance does not decrease significantly.
The robustness is due to the tuned neurons still maintaining their underlying temporal
structure. Figure 5-11 shows the tuning polar plots of the four original neurons with the
93
Figure 5-7. Tuning preference for four classes (initialized)
added random spikes. Although this figure shows that tuning broadens across many
angular bins, the random spikes do not have a temporal structure. Therefore they do not
displace the temporal structure of the tuned neurons significantly (as indicated by only
a small change in performance). Please note that increasing the probability of random
spikes to 16% every 1ms puts the spiking beyond the realistic firing rate of real neurons.
As explained, the different artificial neurons are modulated so that their tuning depth
monotonically increased(i.e. β set from 1 to 4). The LM-HMM clustering successfully
selects the neurons in the correct order (respective to tuning depth) from the 100
random neural channels. The result is the same when the tuned neurons are corrupted
with random spike noise.
94
Figure 5-8. Tuned classes after clustering (four classes)
Given that the clustering model still achieves good performance during noise, it
is important ensure that the simulation is not too simplistic. Therefore, two surrogate
data sets are generated from the simulation data. The first surrogate is generated by
randomizing the spatial relationships between the neurons. Specifically, at each time bin
the bin counts for each neuron are randomly switch with another channel. This process
is repeated through the length of the data set where the spatial relationship of bin N is
different than bin N − 1.
Figure 5-12 shows the clustering results on such a surrogate dataset. The
clustering model correctly fails to cluster the data since the spatial information is ruined.
95
Figure 5-9. LM-HMM cluster iterations (two classes, k=4)
Figure 5-10. Classification degradation with increased random firings
96
Figure 5-11. Neural tuning depth with high random firing rate
Figure 5-13 shows the tuned preference of the clusters after clustering. This figure
correctly shows that there is no tuned preference since the model failed (as is hoped).
The second surrogate is generated by randomizing the temporal relationships
between the neurons. Specifically, at each time bin the bin counts for all of the neurons
are randomly switched with another bin in time (keeping the same channel). This
process is repeated through the length of the data set where the temporal relationship
is destroyed but the spatial relationship is kept intact. Figure 5-14 shows the clustering
results on such a surrogate dataset. It correctly fails to cluster the data since the spatial
information through time is ruined. Figure 5-15 shows the tuned preference of the
97
Figure 5-12. Surrogate data set destroying spatial information
Figure 5-13. Tuned preference after clustering (spatial surrogate)
clusters after clustering. This figure correctly shows that there is no tuned preference
since the model failed (as is hoped).
98
Figure 5-14. Surrogate data set destroying temporal information
Figure 5-15. Tuned preference after clustering (temporal surrogate)
99
Clustering with the DC-HMM.
Figure 5-16 demonstrates the clustering results using the DC-HMM on the two-class
simulated data set. For this particular experiment the number of clusters k = 2. The
figure shows that the model is able to correctly cluster the data in a relatively small
number of iterations (three to four). The figure also shows that on the first iteration the
different class labels were randomly assigned to the two classes (indicated by green
and brown colors) start to converge to a pattern similar to the kinematics from which the
input was derived (bottom of figure cosine and sine wave). Although the kinematics are
shown below the class labels, the clustering results are acquired solely from the input
space.
Figure 5-16. DC-HMM clustering results (class=2, K=2)
Figure 5-17 illustrates the hidden state transitions for the DC-HMM on the simulated
data set. Interestingly a repeating pattern matches the corresponding kinematics.
Additionally the figure shows that pairs of neurons are actively involved in the hidden
state space. This pairing is expected since the simulated neurons are in pairs for the
100
two classes with the extra three neurons being random noise. The result is further
confirmed with Figure 5-18 where only four neurons have active couplings. Furthermore,
the correct coupling is observed between neurons.
Figure 5-17. DC-HMM clustering hidden state transitions (class=2, K=2)
Figure 5-18. DC-HMM clustering coupling coefficient (class=2, K=2)
Figure 5-19 presents the likelihoods generated at each iteration of the clustering
round. The two likelihood figures are for each class. This figure clearly shows a
101
converging likelihood for this particular data set. Overall, the DC-HMM captures the
underlying classes/clusters on the simulated neural data .
Figure 5-19. DC-HMM clustering log-likelihood reduction during each round (class=2,K=2)
For the simulation in Figure 5-20, the correct number of clusters k = 4 matches the
underlying number of classes in the input data. The clustering results demonstrate when
four classes are actually present in the neural data. The model correctly matches the
underlying clusters present in the neural data. Unfortunately, the DC-HMM clustering
produces a few more errors than the LM-HMM clustering of the same data.
Figure 5-21 shows the hidden state transitions for the DC-HMM on the simulated
data set with four classes. Interestingly, repeating patterns are observed that correspond
to the kinematics but not as obvious as the two class version. Additionally from this
figure, pairs of neurons are actively involved with regard to hidden state space. This
pairing is expected since the simulated neurons are in pairs for the two classes without
the extra noise neurons. This result is further confirmed with Figure 5-22 where the
correct couplings are observed between neurons.
102
Figure 5-20. DC-HMM clustering simulated neurons (class=4, K=4)
103
Figure 5-21. DC-HMM clustering hidden state space transitions between neurons
104
Figure 5-22. DC-HMM clustering coupling coefficient between neurons (per Class)
105
Self-Organizing-Maps.
To provide a fair comparison of the clustering methods described above, the
simulated data sets are clustered with one of the most common clustering techniques
available, Self-Organizing Maps (SOM). A SOM or sometimes called a Kohonen map
[73], is a type of unsupervised neural network. The goal of this model is to learn in
an unsupervised way the representation of the input space. SOM are also different
from other neural networks since they use a neighborhood function to preserve the
topological properties of the input space. For more details on the SOM please see
Appendix C.
For the experiments described below, various numbers of processing elements
(PE’s) were tested empirically. The best compromise in computational complexity and
performance resulted in using 25 PE’s. The initial step size used for the ordering phase
was .9 while the converging or mapping phase started with a .02 step size. Although the
static version of the SOM was tested (which failed on this type data), time embedding
was added to the SOM (Appendix C) in order to provide the best comparison.
Figure 5-23 shows the SOM results on the two class simulation with independent
neurons. The figure shows a successful clustering of the oscillating classes (as
discussed earlier). Classification performance was 96.4%. Overall, this clustering
model is able to cluster the simplistic simulation.
In order to test the robustness of the SOM, noise was added to the simulation.
Figure 5-24 shows when the probability of randomly firing is increased to 16% per time
bin (which is beyond real neural firing rates). Clearly the figure and classification results
(92.3%) demonstrate that the SOM is successful and robust enough to cluster the
simulation with noisy independent neurons.
Next, spatial and temporal surrogates are used to further test the robustness of
the SOM. Figure 5-25 presents the clustering results with the spatial surrogate data.
The classification results are 85.58% which is completely incorrect since all of the
106
spatial information has been destroyed. The SOM is incorrectly capturing the large
peeks in the firing rates of some of the neurons (figure 5-26). By default, the spatial
surrogate randomly places the firing rates of the different neurons with other neurons.
This is interesting since the SOM is failing and is artificially capturing structure that
does not exist and is simply selecting the bursting activity from some of the neurons.
This incorrect result may help to explain why the some of the previous BMI results with
non-linear filters worked better with the ballistic food grasping tasks since those types of
neurons would be more modulated (i.e. bursting) at coinciding points to the movement.
In contrast, the neurons are always firing and modulating as in the cursor control task
thereby decreasing the performance of the linear/non-linear models (while the graphical
models do not suffer as much).
Finally, temporal surrogate data is clustered with the SOM. In this instance the SOM
fails (as expected) at clustering this dataset (figure 5-25). The clustering results were
Figure 5-23. SOM clustering on independent neural data (2Classes)
107
Figure 5-24. SOM clustering on independent neural data with noise (2Classes)
Figure 5-25. SOM clustering on independent neural data spatial surrogate(2classes)
108
Figure 5-26. Neural selection by SOM on spatial surrogate data(2classes)
just above random (53.04%), which is valid for a data set that has temporal structure
destroyed.
109
Figure 5-27. SOM clustering on independent neural data temporal surrogate(2classes)
110
5.2.3 Dependent Neural Simulation Results
Figure 5-28 shows the spike output (100ms bins) from the dependent neurons. In
the figure, four dependently generated neurons and 100 fake neurons are shown with
darker colors indicating a higher firing rate. The figure does not provide discernible
patterns, even though an alternating pattern is underlying the data (similar to original
independent neurons). Figure 5-29 provides further evidence that simplistic patterns do
not exist since the neurons are not specifically tuned to any particular angle.
Figure 5-28. Output from four simulated dependent neurons with 100 noisechannels(Class=2)
Figure 5-30 shows the clustering results using the LM-HMM with the clustering
methodology. Despite not observing any visual oscillating pattern in the neural data, the
model is able to correctly cluster (as seen in the pattern). The figure also demonstrates
that only a small number of clustering iterations are needed to discern the pattern. With
respect to the 104 neurons the model was able to correctly identify the four neurons that
were pertinent to the clusters.
111
Figure 5-29. Neural tuning for dependent neuron simulation
Interestingly, Figure 5-31 shows that the DC-HMM is also able to correctly cluster
the neural data with dependencies.
The most interesting results are shown in Figure 5-32. The figure shows the
clustering results from the time-embedded SOM. Clearly the SOM is not successful in
clustering the oscillating classes. The classification results are slightly above random
(53.98%). The SOM is not capturing the pattern since individual neurons are not
bursting or modulating with significant increases in firing rate. Although a state-based
model generated the data, this simulation provides more evidence that the graphical
models have the ability to capture the communication between neurons through time.
112
Figure 5-30. LM-HMM clustering simulated dependent neurons (class=2, K=2)
Figure 5-31. DC-HMM clustering simulated dependent neurons (class=2, K=2)
113
Figure 5-32. SOM clustering on dependent neural data (2classes)
114
5.3 Experimental Animal Data
5.3.1 Rat Experiments
For the first experiment, 5000 data points of the single lever press experiment are
used for clustering. With the initial clustering round, all of the data points are randomly
labeled for the two classes (lever press and non-lever-press). These experiments call for
many parameters to be initialized. These include
1. Observation length (window size)
2. Number of states
3. Number of clustering rounds
The observation length was varied from 5 to 15 which corresponds to .5 seconds to
1.5 seconds. The number of hidden states were varied from 3 to 5. Finally, the number
of rounds were varied from 4 to 10. After exhausting the different combinations, an
observation length (time window) of 10 was selected along with 3 hidden states and 6
rounds. These parameters were kept the same for all neural channels.
Figure 5-33, illustrates that the model provides reasonable clustering of the two
classes in the single lever press. With respect to classification performance, the
model is able to correctly classify each class around 66% of the time. Remarkably,
despite 66% being a low number, the unsupervised clustering model achieves a
classification performance that rivals the supervised classification results. Of course
to compare classifications the class labels must be assigned since they are unknown
after clustering. The labels are assigned based on kinematic features, like a lever press,
and the different priors (lever presses have far less samples than not-lever presses).
Under normal clustering (without classification comparisons), the class labels are
unknown. Most likely the experimental setup will need to focus on simple tasks by
which the patient expresses their desired goals (move arm left etc). This patient-based
direction would allow the different class labels to be appropriately assigned.
115
Figure 5-33. Rat clustering experiment, one lever, two classes
Figure 5-34 presents a slightly expanded picture of figure 5-33. Notice how the
clustering fails in certain locations. These failures may be due to the actual experiment
since the rat moves around the cage without being recorded.
In the next experiment, 10000 data points of the two-lever rat experiment are used
for clustering. For the initial clustering round, all of the data points are randomly labeled
for the two classes. The same parameters selected in the previous experiment are used
for this experiment. Unlike the previous experiment, this experiment includes the time
location for cue-signals and rewards. Figure 5-35 shows where the cue signals and
rewards are located as well as the lever presses (red is left, green is right). For example,
on the fifth cue-signal the rat was supposed to press left but instead pressed right (as
indicated by colors on the plot).
Figure 5-35 shows consistent and repeating clustering patterns. Unfortunately, the
results are not as good as the experiment with a single lever press. The difference may
116
Figure 5-34. Rat clustering experiment zoomed, one lever, two classes
Figure 5-35. Rat clustering experiment, two lever, two classes
117
be attributed to the type of experiment since more primitives are being employed for this
double-level experiment (as well as the lack of data discussed earlier).
Figure 5-36. Rat clustering experiment, two lever, three classes
Figure 5-36 shows the results when clustering the data for the two-lever press into
three classes. Interestingly there are some consistent results but visually the results are
difficult to interpret. The consistencies involve the transition from one class to another
class (like red to green or green to blue). These transitions coincide when a kinematic
event is exhibited (i.e lever pressed). Additionally, these transitions are similar to what
was observed in the simulations.
Figure 5-37 again shows some interesting behavior with the clustering results when
using four classes. There are consistencies when looking at the transitions from one
class to another before and after the lever presses. Unfortunately, there aren’t even 3 or
4 kinematic events that are applicable for classification. Only qualitative interpretation is
appropriate for this dataset. This lack of quantifiable metrics is why the monkey data and
simulations are more pertinent to testing the discussed methodologies.
118
Figure 5-37. Rat clustering experiment, two lever, four classes
Overall, the clustering results for the rat data are mixed. In future work, perhaps
isolating the time data around the lever presses when the rat is more focused on the
task could improve modeling. Although not presented here, preliminary results indicate
this improvement to be the case.
119
5.3.2 Monkey Experiments
LM-HMM.
Two important questions must be answered with the following experiments.
First, what type of clustering results are obtained, i.e. are there repeating patterns
corresponding to the kinematics. Second, how does the trajectory reconstruction
from the unsupervised clustering compare against the trajectory reconstruction from
supervised BMI algorithms. Classification performance is not considered in these
experiments since there are no known classes by which to test. Although angular bins
might serve the purpose for single neurons, it is known that some neurons do not simply
modulate for angular bins. Therefore, correlation coefficient will serve as the metric and
allow a consistent comparison between the results in this paper and previous work.
Multiple Monte Carlo simulations were computed to eliminate spurious effects from
initial random conditions (class labels, parameters, etc). The parameters that require
initialization include:
1. Observation length (window size)
2. Number of states
3. Number of clustering rounds
4. Number of classes
To determine the parameters, the observation length (window size) was varied
from 5 to 15 time bins (corresponding to .5 seconds to 1.5 seconds). The number of
hidden states was varied from 3 to 5. While the number of clustering iterations varied
from 4 to 10. After exhausting the different combinations, the parameters are set to: an
observation length equal to 5 time bins, 3 hidden states and 6 clustering iterations (since
less than 5% of the labels changed). These parameters are the same for each neural
channel.
The model was initialized with four classes after an empirical search (using different
parameter sets). Since ground truths are unknown, trajectory reconstruction serves
120
as the basis for how many clusters to select. Specifically, based on reconstruction
performance, an adjustment is made to the number of clusters. Qualitatively, Figure 5-38
shows the labeling results from the LM-HMM clustering. The Y-Axis represents the
number of iterations from initial random labels (top), to the final clustering iteration
(bottom). Each color in the clustering results corresponds to a different class (four in all).
The kinematics (x and y velocities) are overlaid at the bottom of the image for this cursor
control experiment. Figure 5-38 shows repeating patterns for similar kinematic profiles.
These repetitive class transitions were also observed in the simulated data. Figure 5-39
shows trajectory reconstruction matches very closely to the original trajectory thereby
indirectly validating the segmentation produced by the clustering method (qualitatively).
Figure 5-38. LM-HMM cluster iterations (Ivy 2D dataset, k=4)
For a quantitative understanding, the correlation coefficient (CC) is a way to show
if the clustering results have merit. Interestingly, the CC results for this unsupervised
clustering are slightly better than the supervised non-linear TDNN and linear Wiener
filter as shown in Table 5-1. As expected random labeling of the classes produces
121
Figure 5-39. Reconstruction using unsupervised LM-HMM clusters (blue) vs. realtrajectory (red)
poor results compared to actual clustering. Additionally the random labeling results
are similar to other supervised BMI models. As discussed earlier, the similar results
are due to the random clusters providing generalization of the full space for each filter
(thereby becoming equivalent to a single Wiener filter). Remarkably, Table 5-1 shows
that the correlation coefficient produced by the unsupervised LM-HMM clustering is only
slightly less than the correlation coefficient produced with the supervised version of the
LM-HMM. This result is understandable since the supervised version of the LM-HMM
consistently isolates the neural data based on kinematic features that were imposed by
the user in labeling the classes.
122
Table 5-1. Correlation coefficient using LM-HMM on 2D monkey data
Experiment CC(X) CC(Y)
LM-HMM (unsupervised) .77±.07 .66±.13
SOM (unsupervised) .65±.04 .59±.12
LM-HMM .79±.04 .75±.11
Single Wiener .66±.02 .48±.10
NMCLM(FIR) .67±.03 .50±.07
NMCLM(Gamma) .67±.02 .47±.07
NLMS .68±.03 .50±.08
Gamma .70±.02 .53±.09
Subspace .70±.03 .58±.08
Weight Decay .71±.03 .57±.08
Kalman .71±.03 .58±.10
TDNN .65±.03 .51±.08
123
DC-HMM.
In the following experiments, the DC-HMM is used to cluster the food grasping
task and the cursor control task. For each experiment, Wiener filters are applied to
the clustered labels for each class in order to see how the trajectory reconstruction is
compared to the supervised reconstruction. An observation length of five and three
hidden states are used to keep the computations low (and empirically showed adequate
results). Additionally k was to four classes based on empirical results.
Table 5-2 shows that the correlation coefficient on the unsupervised DC-HMM
clustering reconstruction is better than using the SOM, random labeling or a single
Wiener filter. Unfortunately, the results are not as good as the supervised version of the
DC-HMM or LM-HMM. Supervised results are expected to be better since the kinematic
features can serve as classes that demarcate partition-able points in the input space
that allow the model to specialize (as opposed to globally learning the class labels).
Table 5-2. Correlation coefficient using DC-HMM on 3D monkey data
Experiment CC
DC-HMM (unsupervised) .72±.19
SOM (unsupervised) .70±.18
Single Wiener .68±.19
NLMS .68±.20
Figure 5-40 demonstrates that there is a corresponding clustering pattern to the
kinematics. Unfortunately there are areas of error or at least what could be perceived
as error since there are no known ground truths to test against. Nevertheless, the CC
shows to be slightly degraded from a supervised version. The degraded results are
attributable to a lack of information between channels. Figure 5-41 shows there are
again more independent neurons than dependent. Perhaps the LM-HMM exploits the
independent neurons better since it builds a consensus among the channels rather than
modeling an explicit joint distribution.
124
Figure 5-40. DC-HMM clustering on monkey food grasping task (2classes)
Figure 5-41. Coupling coefficient from DC-HMM clustering on monkey food graspingtask (2classes)
For the cursor control monkey data, Table 5-3 illustrates that the DC-HMM produces
similar CC results to that of the LM-HMM. Again, both of these clustering results are
better than the supervised versions of the non-linear TDNN and single linear Wiener
125
filter. One reason for an improvement on the cursor control experiment over the food
grasping is that the type of movements tends to be very velocity based (quick) as
opposed position based (holding in the air). Another interesting difference is that the
cursor control task has more dependent neurons as shown in Figure 5-42. This would
help to explain why the Wiener filters produce better results on the food grasping
task over cursor control (i.e. independent neurons are simply modulating with task).
Additionally, the simulations with dependent neurons showed that the state-space model
were better than the SOM at clustering this type of data.
Table 5-3. Correlation coefficient using DC-HMM on cursor control data
Experiment CC(X) CC(Y)
DC-HMM clustering .71±.09 .65±.13
LM-HMM clustering .77±.07 .66±.13
SOM clustering .65±.04 .59±.12
Single Wiener .66±.02 .48±.10
NLMS .68±.03 .50±.08
TDNN .65±.03 .51±.08
Figure 5-43 shows the effect of averaging the firing rates across a time window
equal to the observation length (five time bins). Each particular class produces
a different firing pattern across the six selected channels. Interesting, some of the
channels fire more in the beginning portion of the window while others fire more towards
the end. Additionally, Figure 5-44 shows the corresponding average kinematics of the
four classes. As mentioned earlier, the monkey makes mostly diagonal movements and
this is clearly observed in this figure (as is hoped).
126
Figure 5-42. Coupling coefficient from DC-HMM clustering on monkey cursor controltask (4classes)
Figure 5-43. Average firing rate per class (4classes,6neurons)
127
Figure 5-44. Average velocity per class (4classes,6neurons)
128
5.3.3 Discussion
The clustering model discussed in this chapter demonstrated the ability to discover
useful clusters while operating solely in the neural input space. The results were first
justified with realistic neural simulations that also included noisy and fake neurons.
Despite the added noise, the clustering method is able to successfully determine the
underlying separation. The division of neural input space was based on the hypothesis
that animals transition between neural state structures during goal seeking analogous to
the motion primitives exhibited during the kinematics [35]. Then the clustering method
was compared to conventional BMI signal processing algorithms on real neural data.
Although, trajectory reconstruction is used to show the validity of the clusters, the model
could be used as front end for a co-adaptive algorithm or goal-oriented tasks (simple
classification that paraplegics could select, i.e. move forward).
Despite these encouraging results, improvements in performance are achievable
for the hierarchical clustering. For example, the generative models in the hierarchical
clustering framework may not be taking full advantage of the dynamic spatial relationships.
Although the hierarchical training methodology does create dependencies between the
HMM experts, perhaps there are better ways to exploit the dependencies or aggregate
the local information. As shown with the coupling coefficients and Viterbi state paths
from the DC-HMM results, there may be important dependencies since different neural
processes are interacting with other neural processes in an asynchronous fashion
and that underlying structure could provide insight into the intrinsic communications
occurring between neurons.
As a final point, there was an interesting effect from the experiments (simulated
and real neural data). Looking closely at some of the results, consistent transitions
occur from different classes to other classes. For example there may be a consistent
transition from class 1 to class 3 and class 2 to class 1. Investigating this phenomenon
further would be interesting. Perhaps there is a switching behavior between stationary
129
points in the input space that could be exploited. This exploration could also give rise to
neurophysiologic understanding of the underlying communication between neurons. The
Chapter 6 explores this possibility by looking within the model’s state transitions.
130
CHAPTER 6CONCLUSION AND FUTURE WORK
Brain machine interfaces have the potential to restore movement to patients
experiencing paralysis. Although great progress has been made towards BMIs there
is still much work to be done. This dissertation addressed some of the problems
associated with the signal processing side of BMIs.
Since there is a lack of understanding in how neurons communicate within the
brain, probabilistic modeling was argued as the best approach for BMI experiments.
Specifically, generative models were proposed with hidden variables to help model the
multiple interacting processes (both hidden and observable). This lead to paradigm
shift similar to divide and conquer but with the generative models. The generative
models also demonstrated that they can solely operate in the neural input space. This
is appropriate since the desired kinematic data is not available from paralyzed patients.
Most BMI algorithms ignore this very important point.
Three major limitations were also addressed with generative models in this
dissertation. These include training the parameters, defining the neural state structure,
and segmenting the data (or clustering). Chapter three dealt with training the parameters
and making a-priori assumptions about the neural state structures. A simple independence
assumption was first assumed then the chapter further explored implicit dependencies
and explicit dependencies between neural channels. Hierarchal modeling frameworks
were presented while spatial relationships were demonstrate to be important since
they can improve or maintain results with less data. Additionally, competitive training
methodologies were introduced that allowed reduction of the input space by have the
neurons competing for inclusion into the final models. Finally, a fully coupled structure
was discussed to see if adding more dependencies to the model improved performance.
This fully coupled structure showed the limitations with increasing the complexity with
131
data limitations. With respect to modeling, the fully coupled structure provides insights
into the underlying biological relationships between neurons.
A-priori knowledge of the graphical structure and class labels were the final
problems with BMI data that the dissertation addressed. Chapter five presented a
clustering approach in which the data space was partitioned based on a likelihood
criterion. This treated the model as a centroid for each cluster and the parameters
were found accordingly. In the next section discussing future work, we will show the
inverse methodology and cluster in the model space using an optimal state-path finding
algorithm to define chain-like models across time and neural channels. Although the
results are similar to the likelihood-based method it provides a way to understand
structure and dependencies between channels in time. Understandably some results
are similar since both methods are approximations to unknown joint probabilities
between many variables (of which we do not have enough data to support a full
network).
Some of the advantages of the generative models over the conventional BMI
signal processing algorithms were also confirmed. This included the modeling neurons
that decrease firing during movement and the ability to separate the neural input
space. The partitioning of the input neural space was based on the hypothesis that
animals transition between neural state structures during goal seeking. These neural
structure are analogous to the motion primitives exhibited during the kinematics [35].
The results were justified with the improvement in trajectory reconstruction. Specifically,
the correlation coefficient on the trajectory reconstruction served as a metric to compare
against other BMI methods. Additionally, simulations were used to show the model’s
ability to cluster unknown data with underlying dependencies. This is necessary since
there are no ground truths in real neural data.
Overall the work described in this dissertation demonstrated improved approaches
to modeling BMIs. This work also addressed the most important problem of patients
132
without limbs (i.e. training with desired kinematic data). We did this by presenting
clustering results on simulated and real data that outperform supervised results.
Ultimately the modeling methodologies within this dissertation can be used as a front
end for either forward modeling (i.e. linear filter etc) or for a BMI goal-oriented system.
6.1 Future Work
Chapter 5 described a clustering methodology that partitioned the data-space with
the LM-HMM and DC-HMM models. These probabilistic models represent dynamic
temporal data and act as centroids while the computed likelihood of the respective
models discriminates for the appropriate cluster labeling (similar to K-means). In
effect, this clustering methodology produces a natural grouping of the spatial-temporal
exemplars into K clusters. Unfortunately, only using the data space limits the model’s
ability to discriminate clusters. Perhaps the richness of the model space could serve
in clustering the data and allow insight into the neurophysiological interaction between
neurons.
Therefore, this chapter explores a clustering methodology that is based on the
model space or the actual model structure (i.e. neural structure) to discriminate the
neural data for particular motion primitives. Specifically, the model structure directly
relates to the dependencies of the hidden states representing neural dependencies (at
least within the data rather than biologically). In the previous chapters, the goal was
to find the global (inter-model) dependencies between channels while now the local
(intra-model) dependencies are sought through time and across the neural channels.
Figure 6-1 shows the Bipartite graph in which the models are now sequestering the data
points.
Clustering the model structures or parameters requires a structure learning
algorithm. These types of algorithms employ searching and scoring functions in
order to build the structure of the model. Since the number of model structures is
large (exponential), a search method is needed in order to decide which structures
133
Figure 6-1. Bipartite graph of exemplars (x) and models
to score. Even a graphical model with a small number of nodes contains too many
networks to exhaustively score. A greedy search could be done by starting with an initial
network (with/without connectivity) and iteratively adding or deleting an edge, measuring
the accuracy of the resulting network at each stage, until a local maxima is found.
Alternatively, a method such as simulated annealing could guide the search to the global
maximum. Iterating through all of the structures is computationally expensive. The
problem is further complicated when multiple structures must be determined for multiple
clusters or classes. Unfortunately, when the classes are undefined (i.e. the experiments
in this dissertation), finding these structures is intractable. Therefore a simple iterative
method must be employed to approximate the structures for each respective class.
Rather than use a conventional search method, a single generative model will be
trained over the full data set. Then a state-path finding algorithm similar to Viterbi will
find the most likely paths for each data exemplar. These paths represent the plausible
dependency structure between channels through time. Once these structures are
134
found, the corresponding neural data will be grouped together. Obviously the number of
structures will be large. Therefore these structures need to be merged into fewer cluster
centers otherwise computation complexity could be become significantly worst.
The next section will explore this method for finding plausible structures. First,
an optimal state path finding algorithm similar to Viterbi will be discussed. Then a
simple histogram methodology will be outlined for clustering the structures. Finally, a
simple way to trim the number of structures will be discussed. After the methodology is
explained experiments with the simulation data as well as the real neural data will be
discussed.
6.1.1 Towards Clustering Model Structures
There are several ways to find the optimal state sequence associated with given
observation sequences. The difficulty lies with the definition of the optimal state
sequence, i.e. there are several optimality criteria. We define
ζt(i) = P (q(c)t = Q
(c)i |O, λ) (6–1)
i.e. the probability of being in state Q(c) at time t, given the observation sequence
O, and the model λ in chain c. Equation 6–1 can be expressed simply in terms of the
forward variables, i.e.,
ζ(c)t (i) =
α(c)t (i)
P (O|λ)=
b(c)j (Ot)
∑Cc=1 Θcc
∑N(c)
i=1 (αct−1(i)a
(c,c)ij )
∏Cc=1(
∑N(C)
j=1 α(c))T (j))
(6–2)
since α(c)t (i) accounts for the partial observation sequence O
(c)1 O
(c)2 ...O
(c)t and
the contributions from each of the channel’s previous states The normalization factor∏C
c=1(∑N(C)
j=1 α(c))T (j) makes ζ
(c)t (i) a probability measure so that
135
C∑c=1
N∑i=1
ζ(c)t (i) = 1. (6–3)
Using ζ(c)t (i), the most likely state qt at time t can be solved individually, as
qt = argmax[ζ(c)t (i)], 1 ≤ t ≤ T (6–4)
By selecting the most likely state on a particular neural channel for each time step,
the size of the model structure is greatly simplified. Essentially a chain structure Gk is
constructed through time and across the channels. Since each model Gk has the same
number of parameters, using BIC to score the models is inappropriate. Therefore a
different approach is necessary to group the models Gk. Finding a different approach
is difficult since there are (CN)T structures even in the simplified formulation discussed
above.
One clue comes from looking at the empirical results. Empirically there are far fewer
realizations of the structures k < (CN)T . By further limiting the number of observed
models to greater than two, significantly less models are empirically observed (merging
the single observations into a single cluster). This led to a simple histogram method
to cluster the models Gk. In order to further reduce the number of models, the top two
models observed the most are chosen (with the rest of the samples relabeled as a third
class).
Let data set D consist of N sequences for J neural channels, D = S11 , ..., S
JN , where
Sjn = (Oj
1, ...OjT ) is a sequences of observables length T.
Specifically the full data set is used and all exemplars S train a single model.
Initially, the parameters are randomized until they converge to an initial guess. Then
the likely state path is found for each training exemplar and a histogram is built from the
found structures. The algorithm is as follows:
136
1. Train the DC-HMM λ on the full data set D as described in Chapter 3
2. Compute the most probable paths qt = argmax[ζ(c)t (i)], 1 ≤ t ≤ T on the state
trellis as described above per each sequence S
3. Develop the appropriate histogram respective of each model structure.
4. Re-label data set D based on K class assignment and train new models
5. Repeat step 2 until threshold is met or convergence
6.1.2 Preliminary Results
Independent neural simulation data.
For this particular experiment the number of clusters k = 2. Empirically, the number
of hidden states was varied from 2 to 5 with the final selection being 3 hidden states for
the model. Based on prior results a sequence length of 5 is chosen along with 2 to 6
rounds of clustering.
Figure 6-2 demonstrates the clustering results using the DC-HMM on the two-class
simulated data set. The figure clearly shows that the model is able to correctly cluster
the data. The classification result (based on prior labeling) is 82% which is comparable
to the likelihood-based clustering in Chapter 5. Although the kinematics is shown below
the class labels, the clustering results are acquired solely from the input space.
Figure 6-3 shows the Viterbi paths for the DC-HMM on the simulated data set.
Interestingly, there are repeating patterns that correspond to the kinematics. Additionally
from this figure, pairs of neurons are actively involved in the state transitions in the
hidden state space. This is expected since the simulated neurons are in pairs for the two
classes.
Food grasping task.
Two results are to be observed from the following experiments. First, what type
of clustering results is obtained, i.e. are there repeating patterns that can be seen
corresponding to the kinematics. Second, how does the trajectory reconstruction
compare against the supervised version. Classification is not computed since there are
137
Figure 6-2. Hidden state transitions DC-HMM (simulation data 2classes)
no known ground truths for the input space (everything like angular bins is imposed by
the user) and unknown kinematic features may be represented in the neural data.
In consideration of spurious results, multiple Monte Carlo simulations were executed
to confirm empirical results. Additionally, there are many parameters to be initialized.
These include:
1. Observation sequence length (window size)
2. Number of states
3. Number of clustering rounds
4. Number of classes
The observation length was varied from 5 to 15 time bins (100ms), which correspond
to .5 seconds to 1.5 seconds. The number of hidden states was varied from 3 to 5.
Finally, the number of clustering iterations was varied from 4 to 10. After exhausting the
138
Figure 6-3. Hidden state transitions DC-HMM (simulation data 2classes)
different combinations the parameters were set as the following: observation length of 5,
3 hidden states and 6 clustering rounds (since less than 5% of the labels changed).
Table 6-1 shows that the correlation coefficient on the unsupervised DC-HMM
model-based clustering reconstruction is better than using random labeling or a single
Wiener filter. Unfortunately, the result is only as good as the likelihood-based clustering.
One reason for this is that both clustering techniques rely on approximations. Figure 6-4
shows the histogram for the state models. Clearly there are a reasonable number of
models found empirically with a particular model observed the most (corresponding to
Table 6-1. Correlation coefficient using DC-HMM on 3D monkey data
Experiment CC
DC-HMM (model clustering) .70±.21
DC-HMM (likelihood clustering) .72±.19
Single Wiener .68±.19
NLMS .68±.20
139
non-movement). Figure 6-5 shows some of the actual Viterbi paths taken by the model.
The y-axis represents the different models while the x-axis is the time bins (of length
five). Each color represents a particular state observed by a corresponding neuron
(similar colors represent the same neuron in the same state).
Figure 6-4. Histogram of state models for the DC-HMM (food grasping task)
Although the underlying approximations lead the final results to be similar,
Figure 6-6 the computed α’s shows that the underlying structure and dependencies
can be discerned. This information may prove useful in a biological setting or future
front-end analysis for different algorithms and needs further exploration.
Cursor control.
For the cursor control monkey data, Table 6-2 shows that the model-based
clustering produces slightly worst results than the likelihood-based clustering. Both of
these clustering results are again better than the supervised versions of the non-linear
TDNN and single linear Wiener filter.
140
Figure 6-5. State models for the DC-HMM (food grasping task)
Figure 6-6. Alphas computed per state per channel DC-HMM (food grasping task)
141
Another example is in Figure 6-7 which shows the α’s computed across channels
and states while the kinematic variable (velocity difference between the two dimensions)
is also plotted on the figure. Obviously there are difficulties in characterizing the
underlying neural structure with respect to a kinematic structure. This is due to how
complicated arm kinematics is executed with millions of muscle fibers.
Figure 6-7. Alphas across state-space of the DC-HMM (cursor control task)
Table 6-2. Correlation coefficient using DC-HMM on 2D monkey data
Experiment CC(X) CC(Y)
DC-HMM model clustering .65±.09 .61±.15
DC-HMM likelihood clustering .71±.09 .65±.13
LM-HMM clustering .77±.07 .66±.13
Single Wiener .66±.02 .48±.10
NLMS .68±.03 .50±.08
TDNN .65±.03 .51±.08
142
Although the clustering the model parameters yielded similar results to clustering
in the data space, there is an added benefit to observing dependent structures. This
may prove to be beneficial from a biological perspective and yield better modeling in the
future as a front-end system.
Additionally, there is much to be improved over the simple naive k-means clustering
of the models. More advanced techniques could also be used to find probable structure
paths in the path optimization stage of the algorithm. The single path requirement could
also be relaxed to provide richer models.
Interestingly, the difference between this model-based clustering and the likelihood
version is that clustering is occurring dynamically within the model as we train with each
new sequence. With the likelihood version, clustering is occurring after each round
rather since the models are updated afterwords. This may lead to a dynamic adaptation
of the structures as new sequences are acquired after an initial clustering on a training
set.
6.2 Contributions
The following list summarizes the contributions of this dissertation
• Showed that changing the paradigm of BMIs of partitioning the input is beneficialthru the use of a bi-model structure and kinematic reconstruction (as well assimulations)
• Developed a method to determine importance of individual neurons in modelperformance with respect to a graphical modeling framework (BM-HMM neuralselections)
• Developed a simple yet powerful method based on boosting and competitivetraining to create weak dependencies between neural channels and allows forreduction of the input space
• Demonstrated graphical model’s ability to capture neurons that decrease firingduring movement as just a particular state
• Developed a hierarchical modeling methodology that modeled stronger dependenciesbetween neural channels without dramatically increasing computational complexityon a reduced neural subset
143
• Demonstrate powerful simulations that could isolate the strengths and weaknessesof the different models
• Developed clustering methods that segmented the input space using log-likelihoodas a distance metric using the model as a centroid
• Developed simple clustering method based on the model space (i.e. hidden statetransitions) and a simple search and score algorithm
144
APPENDIX AWIENER FILTER
The Wiener-Hopf solution is used to estimate the weight matrix for the Wiener filter
WWiener = R−1P (A–1)
where R is the correlation matrix of neural spike inputs with the dimension of (L ·M)x(L ·M),
R=
r11 r12 .. r1M
r21 r22 .. r2M
. . . .
. . . .
rM1 rM2 .. rMN
and rij is the LxL cross-correlation matrix between neurons i and j (i 6= j), and rii
is the LxL autocorrelation matrix of neuron i. P is the (L ·M)xC cross-correlation matrix
between the neuronal bin count and hand position as
P=
p11 .. p1C
p21 .. p2C
. . .
. . .
pM1 .. pMC
where pic is the cross-correlation vector between neuron i and the c-coordinate
of hand position. Given the assumption that the error is a white Gaussian distribution
and the data is station, then the estimated weights WWiener are found to be optimal.
Essentially, W TWienerx minimizes the mean square error (MSE) cost function,
J = E[‖e‖2], e = d− y (A–2)
Each sub-block matrix rij can be decomposed as
145
rij=
rij(0) rij(1) .. rij(L− 1)
rij(−1) rij(0) .. rij(L− 2)
. . . .
. . . .
rij(1− L) rij(2− L) .. rij(0)
where rij(τ) represents the correlation between neurons i and j with time lag
τ . These correlations, which are the second order moments of discrete-time random
processes xi(m) and xj(k), are the functions of the time difference (m − k) based on
the assumption of wide sense stationary (m and k denote discrete time instances for
each process) [59, 19]. In this case, the estimate of correlation between two neurons,
rij(m− k), can be obtained by
rij(m− k) = E[xi(m)xj(k) ≈ 1
N − 1
N∑n=1
xi(n−m)xj(n− k),∀i, j ∈ (1, ..., M) (A–3)
The cross-correlation vector pic can be decomposed and estimated in the same
way. rij(τ) is estimated using equation A–3 from the neuronal bin count data with xi(n)
and xj(n) being the bin count of neurons i and j respectively. From equation A–3, it can
be seen that rij(t) is equal to rji(−t).
146
Figure A-1. Topology of the linear filter for three output variables
147
APPENDIX BPARTIAL DERIVATIVES
To simplify notation,
let δ =
1, x = y
0, x 6= yand z
(c,c)ijt = a
(c,c)ij b
(c)j (o
(c)t ). Let ~w = (π, A, B, Θ) be a parameter
vector, then
α(c)t (j) =
π(c)j b
(c)j (o
(c)1 ), t = 1
∑c
∑i z
(c,c)ijt α
(c)t−1(i), 2 ≤ t ≤ T
(B–1)
∂P
∂w=
∑c
(P
P (c)
∂P (c)
∂w) =
∑c
(P
P (c)
N∑j=1
∂α(c)T (j)
∂w) (B–2)
Using
∂α(c)t (j)
∂w
∑c
∑i
(∂z
(c,c)ijt
∂wα
(c)t−1(i) + z
((c),c)ijt
∂α(c)t−1(i)
∂w), 2 ≤ t ≤ T (B–3)
produces the first order derivatives of α(c)t (j) with respect to each type of parameter
as follows
∂α(c)t (j)
∂π(c1)i
=
δijδc,c1b(c1)j (o
(c1)1 ), t = 1
∑Cc=1
∑N(c)
k=1 z(c,c)kjt
∂α(c)t−1(k)
∂π(c1)i
, 2 ≤ t ≤ T(B–4)
∂α(c)t (j)
∂a(c1,c2)i1j1
=
0, t = 1
δc,c2δj,j1Θc1,c2b(c2)j1
(o(c2)t )α
(c2)t−1(i1) +
∑c
∑i z
(c,c)ijt
∂α(c)t−1(i)
∂a(c1,c2)i1j1
, 2 ≤ t ≤ T(B–5)
∂α(c)t (j)
∂b(c1)j1
(k)=
δo1(c),kδc,c1δj,j1π
c1j1
, t = 1∑
c
∑i(δo1
(c),kδc,c1δj,j1Θc,c1a(c,c1)ij1
α(c)t−1(i) + z
(c,c)ijt
∂α(c)t−1(i)
∂b(c1)j1
(k)), 2 ≤ t ≤ T
(B–6)
148
∂α(c)t (j)
∂Θc1c2
=
0, t = 1
δc,c2
∑i a
(c1,c2)ij b
(c2)j (k)α
(c1)t−1(i) +
∑c
∑i z
(c)ijt
∂α(c)t−1(i)
∂Θc1,c2, 2 ≤ t ≤ T
(B–7)
149
APPENDIX CSELF ORGANIZING MAP
The SOM is most related to the idea of soft competition. The weights and inputs of
the model form a mapping between each. Essentially, the weight vector of a particular
PE that is closest to the present input wins the competition. But with the SOM the
neighbors of the winning PE have their weights updated according to the competitive
rule
wij(n + 1) = wi(n) + ηyi(n)(xj(n)− wij(n)) (C–1)
The lateral inhibition network is assumed to produce a Gaussian distribution centered
at the winning PE. This allows the algorithm to just find the winning PE and assume the
other PEs have an activity proportional to the Gaussian function evaluated at each PE’s
distance from the winner. The SOM competitive rule becomes
wi(n + 1) = wi + ∆i,i∗(n)η(n)(x(n)− wi(n)) (C–2)
where the ∆ function is a neighborhood function centered at the winning PE. During
each iteration the neighborhood function and step size change. The neighborhood
function ∆ is a Gaussian for the experiments described in the dissertation:
∆i,i∗(n) = exp(−d2
i,i∗
2σ2(n)) (C–3)
with a variance that decreases with each iteration. As first, the full map is almost
covered, then at each iteration the variance reduces to a neighborhood of zero, finally
allowing only the winning PE to be updated. A linear decrease in neighborhood radius is
specified by
σ(n) = σ0(1− n/N0) (C–4)
150
Note that for the winning PE the adaptation rule defaults to the competitive update
wi∗(n + 1) = wi∗(n) + η(x(n)− wi∗(n)) (C–5)
The updates for the neighbors are reduced exponentially by the distance to the
winning PE. The network moves from a soft competition to hard competition as the
neighborhood shrinks.
Since the winning PE and its neighbors are updated at each step, the winner and
all of its neighbors move toward the same position, although the neighbors move more
slowly as their distance from the winning PE increases. Over time, this organizes the
PEs so that neighboring PEs (in the SOM output space) share the representation of the
same area in the input space (are neighbors in the input space), regardless of their initial
locations.
Figure C-1. Self-Organizing-Map architecture with 2D output
There are two phases in SOM learning. The first phase deals with the initial
ordering of the weights. During this phase the neighborhood function starts large,
151
covering the full output space to allow PEs that respond to similar inputs to be brought
together. The learning rate is also set to a large value (greater than 0.1) to allow the
network to self-organize. The scheduling of η is normally also linear
∆η(n) = η0(1− n/(N + K))h (C–6)
where η0 is the initial learning rate and K helps specify the final learning rate.
The second phase of learning is called the convergence phase. In this longer phase
of the SOM, the learning rate is to a smaller value (0.01) while using the smallest
neighborhood (just the PE or its nearest neighbors). This is to achieve a fine-tuning
of the weights. Similar to determining the number of the clusters with most clustering
models, the choosing the number of PEs is done empirically. The amount of training
time and accuracy is balanced with how many PEs are chosen for the SOM
152
REFERENCES
[1] Kamil, A. C. (2004). Sociality and the evolution of intelligence. Trends in CognitiveScience, 8, 195-197.
[2] David, M. (2002). The sociological critique of evolutionary psychology: Beyondmass modularity. New Genetics and Society, 21, 303-313.
[3] Walter J. Freeman, Mass Action in the Nervous System, University of California,Berkeley, USA
[4] J. Wessberg, C. R. Stambaugh, J. D. Kralik, P. D. Beck, M. Laubach, J. K. Chapin,J. Kim, S. J. Biggs, M. A. Srinivasan, and M. Nicolelis et al., “Real-time predictionof hand trajectory by ensembles of cortical neurons in primates,” Nature, Vol. 408,pp. 361-365, 2000.
[5] Nobunga, A.I., Go, B.K., Karunas, R.B. (1999) Recent demographic and injurytrends in people served by the model spinal cord injury care systems. Arch. Phys.Med. Rehabil., 80, pp. 1372-1382.
[6] A. B. Schwartz, D. M. Taylor, and S. I. H. Tillery, “Extraction algorithms for corticalcontrol of arm prosthetics,” Current Opinion in Neurobiology, Vol. 11, pp. 701-708,2001.
[7] E. C. Leuthardt, G. Schalk, D. Moran, and J. G. Ojemann, ”The emerging world ofmotor neuroprosthetics: A neurosurgical perspective,” Neurosurgery, vol. 59, pp.1-13, Jul 2006.
[8] M. A. L. Nicolelis, D. F. Dimitrov, J. M. Carmena, R. E. Crist, G. Lehew, J. D.Kralik, and S. P. Wise, “Chronic, multisite, multielectrode recordings in macaquemonkeys,” PNAS, Vol. 100, No. 19, pp. 11041 - 11046, 2003.
[9] A. P. Georgopoulos, J. T. Lurito, M. Petrides, A. B. Schwartz, and J. T. MasseyMental rotation of the neuronal population vector, Science 13 January 1989: Vol.243. no. 4888, pp. 234 - 236
[10] M.A. Lebedev, and M. A. L. Nicolelis, “Brain-machine interfaces: past, present andfuture,” Trends Neurosci 29, Vol 18, pp. 536-546, 2006
[11] Donoghue, J.P. (2002) Connecting cortex to machines: recent advances in braininterfaces. Nature Neurosci. Suppl., 5, pp. 1085-1088.
[12] M. A. L. Nicolelis, D.F. Dimitrov, J.M. Carmena, R.E. Crist, G Lehew, J. D.Kralik, and S.P. Wise, ”Chronic, multisite, multielectrode recordings in macaquemonkeys,” PNAS, vol. 100, no. 19, pp. 11041 - 11046, 2003.
[13] E. Todorov, On the role of primary motor cortex in arm movement control,InProgress in Motor Control III, ch 6, pp 125-166, Latash and Levin (eds), HumanKinetics .
153
[14] A. Georgopoulos, J. Kalaska, R. Caminiti, and J. Massey, ”On the relationsbetween the direction of two-dimensional arm movements and cell discharge inprimate motor cortex.,” Journal of Neuroscience, vol. 2, pp. 1527-1537, 1982.
[15] F. Wood, Prabhat, J. P. Donoghue, and M. J. Black. Inferring attentional state andkinematics from motor cortical firing rates . In Proceedings of the 27th Conferenceon IEEE Engineering Medicine Biologicial System
[16] S. Darmanjian, S. P. Kim, M. C. Nechyba, S. Morrison, J. Principe, J. Wessberg,and M. A. L. Nicolelis, “Bimodel Brain-Machine Interface for Motor Control ofRobotic Prosthetic,” IEEE Int. Conf. on Intelligent Robots and Systems, pp.112-116, 2003.
[17] J. C. Sanchez, J. C. Principe, and P. R. Carney, ”Is Neuron DiscriminationPreprocessing Necessary for Linear and Nonlinear Brain Machine InterfaceModels?,” accepted to 11th International Conference on Human-ComputerInteraction, vol. 5, pp. 1-5, 2005.
[18] J. DiGiovanna, J. C. Sanchez, and J. C. Principe, ”Improved Linear BMI Systemsvia Population Averaging,” presented at IEEE International Conference of theEngineering in Medicine and Biology Society, New York, pp. 1608-1611, 2006.
[19] S. P. Kim, J. C. Sanchez, D. Erdogmus, Y. N. Rao, J. C. Principe, and M. A.L.Nicolelis, “Divide-and-conquer Approach for Brain-Machine Interfaces: NonlinearMixture of Competitive Linear Models,” Neural Networks, Vol. 16, pp. 865-871,2003.
[20] J.C. Sanchez, S.-P. Kim, D. Erdogmus, Y.N. Rao, J.C. Principe, J. Wessberg,and M. Nicolelis, “Input-Output Mapping Performance of Linear and NonlinearModels for Estimating Hand Trajectories from Cortical Neuronal Firing Patterns,”International Workshop for Neural Network Signal Processing, pp. 139-148, 2002.
[21] Wu, W., Black, M. J., Gao, Y., Bienenstock, E., Serruya, M., Shaikhouni, A., andDonoghue, J. P. (2003). Neural decoding of cursor motion using a Kalman filter.Advances in Neural Information Processing Systems 15 (pp. 133-140). MIT Press.
[22] L. R. Hochberg, M. D. Serruya, G. M. Friehs, J. A. Mukand, M. Saleh, A. H.Caplan, A. Branner, D. Chen, R. D. Penn, and J. P. Donoghue, ”Neuronalensemble control of prosthetic devices by a human with tetraplegia,” Nature,vol. 442, pp. 164-171, 2006.
[23] J. DiGiovanna, B. Mahmoudi, J. Fortes, J. C. Principe, and J. C. Sanchez,”Co-adaptive Brain-Machine Interface via Reinforcement Learning,” IEEETransactions on Biomedical Engineering, in press, 2008.
[24] Jing Hu, Jennie Si, Byron Olson, Jiping He. ”A support vector brain-machineinterface for cortical control of directions.” The First IEEE/RAS-EMBS International
154
Conference on Biomedical Robotics and Biomechatronics. Pisa, Italy. February20-22, 2006, pp. 893- 898.
[25] S. Darmanjian, S. P. Kim, M. C. Nechyba, J. Principe, J. Wessberg, and M.A. L. Nicolelis, “Bimodel Brain-Machine Interface for Motor Control of RoboticProsthetic,” IEEE Machine Learning For Signal Processing, pp. 379-384, 2006.
[26] B. M. Yu, C. Kemere, G. Santhanam, A. Afshar, S. I. Ryu, T. H. Meng, M.Sahani, K. V. Shenoy, (2007) Mixture of trajectory models for neural decodingof goal-directed movements. Journal of Neurophysiology. 97:3763-3780.
[27] S. Darmanjian and J. Principe, “Boosted and Linked Mixtures of HMMs forBrain-Machine Interfaces,” EURASIP Journal on Advances in Signal Processing,vol. 2008, Article ID 216453, 12 pages doi:10.1155/2008/216453
[28] S. Darmanjian, A. R. C. Paiva, J. C. Principe, M. C. Nechyba, J. Wessberg, M. A.L. Nicolelis, and J. C. Sanchez, “Hierarchal decomposition of neural data usingboosted mixtures of independently coupled hidden markov chains,” InternationalJoint Conference on Neural Networks, pp. 89-93, 2007.
[29] G. Radons, J. D. Becker, B. Dulfer, J. Kruger. Analysis, classification, andcoding of multielectrode spike trains with hidden Markov models. Biol Cybern71: 359-373, 1994.
[30] I. Gat. Unsupervised learning of cell activities in the associative cortex of behavingmonkeys, using hidden Markov models. Master thesis, Hebrew Univ. Jerusalem(1994).
[31] C. Kemere, G. Santhanam, B. M. Yu, A. Afshar, S. I. Ryu, T. H. Meng, K. V.Shenoy (2008) Detecting neural state transitions using hidden Markov models formotor cortical prostheses. Journal of Neurophysiology. 100:2441-2452
[32] B. H. Juang, and L. R. Rabiner, “Issues in using Hidden Markov models for speechrecognition,” Advances in speech signal processing, edited by S. Furui and M.M.Sondhi, Marcel Dekker, inc., pp. 509-553, 1992
[33] Krogh, A. (1997) Two methods for improving performance of a HMM and theirapplication for gene finding In Gaasterland, T., Karp, P., Karplus, K., Ouzounis, C.,Sander, C., and Valencia, A. (Eds.), Proc. of Fifth Int. Conf. on Intelligent Systemsfor Molecular Biology pp. 179186 Menlo Park, CA. AAAI Press
[34] I. Cadez and P. Smyth, “Probabilistic Clustering using Hierarchical Models,”Technical Report No. 99-16 Department of Information and Computer ScienceUniversity of California, Irvine
[35] L. Goncalves, E. D. Bernardo and P. Perona, Movemes for Modeling BiologicalMotion Perception Book Series Theory and Decision Library Volume Volume 38Book Seeing, Thinking and Knowing Publisher Springer Netherlands
155
[36] M. I. Jordan, Learning in Graphical Models, MIT Press, 1999.
[37] N. G. Hatsopoulos, Q. Xu, and Y. Amit, “Encoding of Movement Fragments in theMotor Cortex,” J. Neurosci., Vol. 27, No. 19, pp. 5105-5114, 2007
[38] X. Huang, A. Acero, H. W. Hon, and R. Reddy , Spoken Language Processing:A Guide to Theory, Algorithm and System Development, Prentice Hall Inc.,Englewood Cliffs, NJ.
[39] R.B. Northrop, Introduction to Dynamic Modeling of Neurosensory Systems, CRCPress, Boca Raton, 2001.
[40] R. P. Rao, “Bayesian computation in recurrent neural circuits,” Neural Computa-tion, Vol. 16, No. 1, pp. 1-38.
[41] M. I. Jordan, Z. Ghahramani, and L. K. Saul, “Hidden Markov decision trees,” InM.C. Mozer, M.I Jordan, and T. Petsche, editors, Advances in Neural InformationProcessing Systems 9, MIT Press, Vol. 9, 1997.
[42] L. E. Baum, T. Petrie, G. Soules and N. Weiss, “A Maximization TechniqueOccurring in the Statistical Analysis of Probabilistic Functions of Markov Chains,”Ann. Mathematical Statistics, Vol. 41, No. 1, pp. 164-71, 1970.
[43] S. Zhong and J. Ghosh. HMMs and coupled HMMs for multi-channel EEGclassification. In Proc. IEEE Int. Joint Conf. Neural Networks, pages 1154-1159,May 2002.
[44] M. Brand, “Coupled hidden Markov models for modeling interacting processes, ”Technical Report 405, MIT Media Lab Perceptual Computing, 1997.
[45] J. Yang, Y. Xu and C. S. Chen, ”Human Action Learning Via Hidden MarkovModel,” IEEE Trans. Systems, Man and Cybernetics, Part A , vol. 27, no. 1, pp.34-44, 1997.
[46] J. Kwon and K. Murphy. Modeling freeway traffic with coupled HMMs. Technicalreport, University of California at Berkeley, May 2000.
[47] T. T. Kristjansson, B. J. Frey, and T. Huang. Event-coupled hidden Markov models.In Proc. IEEE Int. Conf. on Multimedia and Exposition, volume 1, pages 385-388,2000.
[48] Z. Ghahramani and M.I. Jordan, “Factorial Hidden Markov Models,” MachineLearning, Vol. 29, pp. 245-275, 1997.
[49] Y. Bengio and P. Frasconi. Input-Output HMMs for sequence processing. IEEETrans. Neural Networks, 7(5):1231-1249, September 1996.
[50] Y. Linde, A. Buzo and R. M. Gray, An Algorithm for Vector Quantizer Design,IEEETrans. Communication, vol. COM-28, no. 1, pp. 84-95, 1980.
156
[51] W. T. Thach, ”Correlation of neural discharge with pattern and force of muscularactivity, joint position, and direction of intended next movement in motor cortexand cerebellum,” Journal of Neurophysiology, vol. 41, pp. 654-676, 1978.
[52] R. E. Schapire, Y. Freund, P. Bartlett, and W. S. Lee, “Boosting the margin: a newexplanation for the effectiveness of voting methods,” In Proc. 14th InternationalConference on Machine Learning, pp. 322-330, 1997.
[53] R. E. Schapire, “The strength of weak Learnability,” Machine Learning, Vol. 5, pp.197-227, 1990.
[54] Y. Freund, R. E. Schapire, “Experiments with a new boosting algorithm,” MachineLearning: Proceedings of the Thirteenth International Conference, pp. 148-156,1996.
[55] R. A. Jacobs., M. I. Jordan., S. J. Nowlan, and G. E. Hinton, “Adaptive mixtures oflocal experts,” Neural Computation, Vol. 3, pp. 79-87, 1991.
[56] R. Avnimelech and N. Intrator, “Boosted mixture of experts: An ensemble learningscheme, ” Neural Computation, Vol. 11, No. 2, pp. 483-497, 1999.
[57] David C. Knill and Alexandre Pouget, The Bayesian brain: the role of uncertaintyin neural coding and computation,TRENDS in Neurosciences Vol.27 No.12December 2004
[58] G. Dornhege. Increasing Information Transfer Rates for Brain-ComputerInterfacing (pdf). Phd thesis, University of Potsdam, Germany
[59] Haykin, S. (1996) Adaptive filter theory. Upper Saddle River, NJ: Prentice Hall.
[60] Haykin, S. (1996) Neural networks: A comprehensive foundation. New York, NY:McMillan.
[61] M. Meila and M. I. Jordan, “Learning fine motion by Markov mixtures of experts,”In D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, editors, Advances in NeuralInformation Processing Systems 8, MIT Press, Vol. 8, 1996.
[62] Donoghue, J. P. and S. P. Wise (1982). ”The motor cortex of the rat:cytoarchitecture and microstimulation mapping.” J. Comp. Neurol. 212(12):76-88.
[63] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classication andRegression Trees, Wadsworth International Group, Belmont, CA, 1984
[64] Y. Sun and J. Li, “Iterative RELIEF for feature weighting,” Proc. 23rd InternationalConference on Machine Learning, ACM Press, pp. 913-920, 2006.
[65] M. A. Lebedev., J. M. Carmena., J. E. O’Doherty., M. Zacksenhouse., C. S.Henriquez, J/ C. Principe, and M. A. L. Nicolelis, “Cortical ensemble adaptation to
157
represent actuators controlled by a brain machine interface,” J. Neurosci., Vol. 25,pp. 4681-4693
[66] L. E. Baum and G. R. Sell. Growth transformations for functions on manifolds.Pacific Journal of Mathematics, pages 211-227, 1968.
[67] J. C. Sanchez, J. C. Principe, and P. R. Carney, “Is Neuron DiscriminationPreprocessing Necessary for Linear and Nonlinear Brain Machine InterfaceModels,” 11th International Conference on Human-Computer Interaction, 2005.
[68] J. Kittler, M. Hatef, R. Duin, and J. Matas, “On combining classifiers,” IEEE Trans.Pattern Analysis and Machine Intelligence, Vol. 20, No. 3, pp. 226-239, 1998.
[69] M. Martinez-Ramon, V. Koltchinskii, G. L. Heileman, and S. Posse, “MRI PatternClassification Using Neuroanatomically Constrained Boosting,” NeuroImage, Vol.31, No. 3, pp. 1129-1141, 2006.
[70] P.Viola and M. Jones, “Rapid object detection using a boosted cascade of simplefeatures, ” CVPR, Vol 1, pp. 511-518, 2001.
[71] A. Bastian, G. Schoner, and A. Riehle, “Preshaping and continuous evolution ofmotor cortical representations during movement preparation,” European Journal ofNeuroscience, Vol. 18, No. 7, pp. 2047-2058, 2003.
[72] M. C. Nechyba, Learning and Validation of Human Control Strategies ,CMU-RI-TR-98-06, Ph.D. Thesis, The Robotics Institute, Carnegie MellonUniversity, 1998.
[73] Kohonen, T. (1982), ”Self-Organized Formation of Topologically Correct FeatureMaps”, Biological Cybernetics, Vol. 43, pp. 59-69.
[74] A. Y. Ng, M. I. Jordan, and Y. Weiss. On spectral clustering: analysis and analgorithm. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances inNeural Information Processing Systems 14, pages 849-856. MIT Press, 2002.
[75] S. Zhong and J. Ghosh. A Unified Framework for Model-based Clustering. Journalof Machine Learning Research. vol. 4, pp. 1001-1037. November 2003.
[76] E. P. Simoncelli, L. Paninski, J. Pillow, and O. Schwartz (2004). Characterizationof neural responses with stochastic stimuli. The New Cognitive Neurosci., 3rdedition, MIT Press
[77] Y. Wang, J. Sanchez, and J. C. Principe, (2007b). Information TheoreticalEstimators of Tuning Depth and Time Delay for Motor Cortex Neurons. NeuralEngineering, 2007. 3rd International IEEE/EMBS Conference on. 502-505
[78] S. Darmanjian, “Generative Neural Structure Clustering for Brain MachineInterface”,Ph.D. proposal, University Of Florida, Gainesville, Fl,2008
158
[79] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul, An introduction tovariational methods for graphical models., Learning in Graphical Models. MITPress, 1998.
[80] J. Alon, S. Sclaroff, G. Kollios, and V. Pavlovic, “Discovering clusters in motiontime-series data Computer Vision and Pattern Recognition,” Proceedings. 2003IEEE Computer Society Conference , Vol. 1, pp. I-375- I-381, 2003.
[81] A. Ypma and T. Heskes, “Categorization of web pages and user clustering withmixtures of hidden markov models,” In Proceedings of the International Workshopon Web Knowledge Discovery and Data Mining, Edmonton, Canada, pp. 31-43,2002.
[82] G. S. Fishman, Monte Carlo: concepts, algorithms, and applications.,Springer-Verlag, 1995.
159
BIOGRAPHICAL SKETCH
Shalom Darmanjian graduated from the University of Florida with a Bachelor of
Science in Computer Engineering in December 2003. After completing his masters in
May 2005, Shalom continued the pursuit of knowledge and moved to the CNEL lab for
the Ph.D. program during the fall of 2005. He received his Ph.D. from the University of
Florida in the fall of 2009. Shalom hopes to continue doing his small part in improving
the world.
160