design and analysis of generative models...

DESIGN AND ANALYSIS OF GENERATIVE MODELSFOR

BRAIN MACHINE INTERFACES

By

SHALOM DARMANJIAN

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2009

c© 2009 Shalom Darmanjian

2

This dissertation is dedicated to my family.

3

ACKNOWLEDGMENTS

Although I still have far to go, I would not even be 1/100th my current distance

without my adviser Dr. Principe. His passion for knowledge inspires me to constantly

improve and learn. I have grown as a student, researcher, and person because of him.

These small words cannot express the large debt of gratitude I owe him.

Thank you to my committee members for their guidance and patience: Dr. Harris,

Dr. Rangarajan, Dr. Sanchez. I especially thank Dr. Slatton for his time under the

circumstances. I truly wish you and your family well.

I am also grateful for the environment Dr. Principe has fostered in CNEL. The

CNEL students past and present have provided great opportunities for discussions,

laughter and growth. Although there are many students in CNEL that have impacted

me, Jeremy Anderson has been there since undergrad taking the ride with me (ups and

downs). I also appreciate the four musketeers along the BMI ride with me: Dr. Antonio

Paiva, Dr. Aysegul Gunduz, Dr. Yiwen Wang and Dr. Jack DiGiovanna. Thank you all

for the helpful discussions and collaboration through the years. Thanks to the new batch

of CNEL students for their discussions and laughter, Sohan Seth, Alex Singh, Erion

Hasanbelliu, Luis Giraldo, Memming Park. Thank you to Julie for keeping CNEL running

smoothly. Thank you also to Marcus (even if he is a republican). Thank you to Shannon

for years of help and advice with the graduate department. A special thanks to a

longtime enemy Giovanni ”eleven cents!” Montrone for years of constructive pessimism

and encouraging words. He has been there since the beginning and hopefully till the

end.

The years during my PhD were also brightened by some special ladies, Sarah,

Melissa, and Grisel. Whether opening my eyes to vegetarian dishes, tattoos or Las

Vegas Casinos, I appreciate the time we spent together. You helped to lighten my stress

and expose me to different worlds. Thank you.

4

Finally, to my sister and nephew. Your love, support and sacrifice kept me going. I

am truly indebted and will always be there for you. You mean everything to me and I’m

very happy to call you my family.

5

TABLE OF CONTENTS

page

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

CHAPTER

1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.1 Overview of BMIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151.2 Experimental Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.2.1 Monkey Food Grasping Task . . . . . . . . . . . . . . . . . . . . . 171.2.2 Monkey Cursor Control . . . . . . . . . . . . . . . . . . . . . . . . . 181.2.3 Rat Lever Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 19

1.3 Review of Modeling Paradigms for BMIs . . . . . . . . . . . . . . . . . . . 201.4 Dissertation Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2 GENERATIVE MODELS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.3 Background on Graphical Models . . . . . . . . . . . . . . . . . . . . . . . 28

2.3.1 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . 302.3.2 Moving Beyond Simple HMMs . . . . . . . . . . . . . . . . . . . . . 32

3 BRAIN MACHINE INTERFACE MODELING: THEORETICAL . . . . . . . . . . 35

3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.2 Independently Coupled HMMs . . . . . . . . . . . . . . . . . . . . . . . . 373.3 Boosted Mixtures of HMMs . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.3.1 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.3.2 Modeling Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.4 Linked Mixtures of HMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . 463.4.1 Modeling Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 463.4.2 Training with Expectation Maximization . . . . . . . . . . . . . . . . 483.4.3 Updating Variational Parameter . . . . . . . . . . . . . . . . . . . . 51

3.5 Dependently Coupled HMMs . . . . . . . . . . . . . . . . . . . . . . . . . 523.5.1 Modeling Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 533.5.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

6

4 BRAIN MACHINE INTERFACE MODELING: RESULTS . . . . . . . . . . . . . 57

4.1 Monkey Food Grasping Task . . . . . . . . . . . . . . . . . . . . . . . . . 594.1.1 Boosted and Linked Mixture of HMMs . . . . . . . . . . . . . . . . 594.1.2 Dependently Coupled HMMs . . . . . . . . . . . . . . . . . . . . . 64

4.2 Rat Single Lever Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 684.3 Monkey Cursor Control Task . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.3.1 Population Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . 724.3.1.1 A-priori class labeling based on population vectors . . . . 744.3.1.2 Simple nave classifiers . . . . . . . . . . . . . . . . . . . 75

4.3.2 Results for the Cursor Control Monkey Experiment . . . . . . . . . 76

5 GENERATIVE CLUSTERING . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.1 Generative Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 835.2 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.2.1 Simulated Data Generation . . . . . . . . . . . . . . . . . . . . . . 865.2.2 Independent Neural Simulation Results . . . . . . . . . . . . . . . 905.2.3 Dependent Neural Simulation Results . . . . . . . . . . . . . . . . 111

5.3 Experimental Animal Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 1155.3.1 Rat Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1155.3.2 Monkey Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 1205.3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

6 CONCLUSION AND FUTURE WORK . . . . . . . . . . . . . . . . . . . . . . . 131

6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1336.1.1 Towards Clustering Model Structures . . . . . . . . . . . . . . . . . 1356.1.2 Preliminary Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

6.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

A WIENER FILTER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

B PARTIAL DERIVATIVES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

C SELF ORGANIZING MAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

7

LIST OF TABLES

Table page

3-1 Classification performance of example single-channel HMM chains . . . . . . . 36

4-1 Classification results (BM-HMM selected channels) . . . . . . . . . . . . . . . . 59

4-2 Classification results (LM-HMM selected channels) . . . . . . . . . . . . . . . . 59

4-3 Classification results (random BM-HMM selected channels) . . . . . . . . . . . 62

4-4 Classification results (random LM-HMM selected channels) . . . . . . . . . . . 62

4-5 Correlation coefficient using DC-HMM on 3D monkey data . . . . . . . . . . . . 65

4-6 NMSE on 3D monkey data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4-7 Classification results (BM-HMM selected channels) . . . . . . . . . . . . . . . . 69

4-8 Classification results (LM-HMM selected channels) . . . . . . . . . . . . . . . . 69

4-9 Classification results (random BM-HMM selected channels) . . . . . . . . . . . 71

4-10 Classification results (random LM-HMM selected channels) . . . . . . . . . . . 71

4-11 Correlation coefficient using different BMI models . . . . . . . . . . . . . . . . . 78


5-1 Correlation coefficient using LM-HMM on 2D monkey data . . . . . . . . . . . . 123


5-3 Correlation coefficient using DC-HMM on cursor control data . . . . . . . . . . 126



8

LIST OF FIGURES

Figure page

1-1 BMI overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1-2 Discritized example of continuous trajectory . . . . . . . . . . . . . . . . . . . . 19

2-1 Various HMM structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3-1 Probabilistic ratios of 14 neurons . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3-2 Zoomed in version of the probabilistic ratios . . . . . . . . . . . . . . . . . . . . 37

3-3 IC-HMM graphical model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3-4 LM-HMM graphical model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3-5 DC-HMM trellis structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4-1 Multiple model methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4-2 Correlation coefficients between channels (monkey moving) . . . . . . . . . . . 60

4-3 Correlation Coefficients between channels (monkey at rest) . . . . . . . . . . . 61

4-4 Correlation coefficient between channels (randomly selected for monkey) . . . 62

4-5 Monkey expert adding experiment . . . . . . . . . . . . . . . . . . . . . . . . . 63

4-6 Parallel peri-event histogram for monkey neural data . . . . . . . . . . . . . . . 64

4-7 Supervised monkey food grasping task reconstruction (position) . . . . . . . . 66

4-8 3D monkey food grasping true trajectory . . . . . . . . . . . . . . . . . . . . . . 67

4-9 Hidden state space transitions between neural channels (for move and rest) . . 67

4-10 Coupling coefficient between neural channels (3D monkey experiment) . . . . 68

4-11 Parallel peri-event histogram for rat neural data . . . . . . . . . . . . . . . . . . 70

4-12 Rat expert adding experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4-13 Neural tuning depth of four simulated neurons . . . . . . . . . . . . . . . . . . . 73

4-14 Histogram of 30 angular velocity bins . . . . . . . . . . . . . . . . . . . . . . . . 74

4-15 2D angular velocities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4-16 A. Parallel tuning curves B. Winning neurons for particular angles . . . . . . . . 76

4-17 Histogram of 10 angular bins . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

9

4-18 A. Parallel tuning curves B. Winning neurons for particular Angles . . . . . . . 77

4-19 True trajectory and reconstructed trajectory (DC-HMM) . . . . . . . . . . . . . 81

4-20 Hidden state transitions per class (cursor control monkey experiment) . . . . . 81

4-21 Coupling coefficient between neurons per class(cursor control monkey experiment) 82

5-1 Bipartite graph of exemplars (x) and models . . . . . . . . . . . . . . . . . . . . 85

5-2 Neural tuning depth of four simulated neurons . . . . . . . . . . . . . . . . . . . 89

5-3 LM-HMM cluster iterations (two classes, k=2) . . . . . . . . . . . . . . . . . . . 90

5-4 Tuning preference for two classes (initialized) . . . . . . . . . . . . . . . . . . . 91

5-5 Tuned classes after clustering (two classes) . . . . . . . . . . . . . . . . . . . . 92

5-6 LM-HMM cluster iterations (four classes, k=4) . . . . . . . . . . . . . . . . . . . 93

5-7 Tuning preference for four classes (initialized) . . . . . . . . . . . . . . . . . . . 94

5-8 Tuned classes after clustering (four classes) . . . . . . . . . . . . . . . . . . . . 95

5-9 LM-HMM cluster iterations (two classes, k=4) . . . . . . . . . . . . . . . . . . . 96

5-10 Classification degradation with increased random firings . . . . . . . . . . . . . 96

5-11 Neural tuning depth with high random firing rate . . . . . . . . . . . . . . . . . 97

5-12 Surrogate data set destroying spatial information . . . . . . . . . . . . . . . . . 98

5-13 Tuned preference after clustering (spatial surrogate) . . . . . . . . . . . . . . . 98

5-14 Surrogate data set destroying temporal information . . . . . . . . . . . . . . . . 99

5-15 Tuned preference after clustering (temporal surrogate) . . . . . . . . . . . . . . 99

5-16 DC-HMM clustering results (class=2, K=2) . . . . . . . . . . . . . . . . . . . . . 100

5-17 DC-HMM clustering hidden state transitions (class=2, K=2) . . . . . . . . . . . 101

5-18 DC-HMM clustering coupling coefficient (class=2, K=2) . . . . . . . . . . . . . 101

5-19 DC-HMM clustering log-likelihood reduction during each round (class=2, K=2) 102

5-20 DC-HMM clustering simulated neurons (class=4, K=4) . . . . . . . . . . . . . . 103

5-21 DC-HMM clustering hidden state space transitions between neurons . . . . . . 104

5-22 DC-HMM clustering coupling coefficient between neurons (per Class) . . . . . 105

5-23 SOM clustering on independent neural data (2Classes) . . . . . . . . . . . . . 107

10

5-24 SOM clustering on independent neural data with noise (2Classes) . . . . . . . 108

5-25 SOM clustering on independent neural data spatial surrogate(2classes) . . . . 108

5-26 Neural selection by SOM on spatial surrogate data(2classes) . . . . . . . . . . 109

5-27 SOM clustering on independent neural data temporal surrogate(2classes) . . . 110

5-28 Output from four simulated dependent neurons with 100 noise channels(Class=2)111

5-29 Neural tuning for dependent neuron simulation . . . . . . . . . . . . . . . . . . 112

5-30 LM-HMM clustering simulated dependent neurons (class=2, K=2) . . . . . . . 113

5-31 DC-HMM clustering simulated dependent neurons (class=2, K=2) . . . . . . . 113

5-32 SOM clustering on dependent neural data (2classes) . . . . . . . . . . . . . . . 114

5-33 Rat clustering experiment, one lever, two classes . . . . . . . . . . . . . . . . . 116

5-34 Rat clustering experiment zoomed, one lever, two classes . . . . . . . . . . . . 117

5-35 Rat clustering experiment, two lever, two classes . . . . . . . . . . . . . . . . . 117

5-36 Rat clustering experiment, two lever, three classes . . . . . . . . . . . . . . . . 118

5-37 Rat clustering experiment, two lever, four classes . . . . . . . . . . . . . . . . . 119

5-38 LM-HMM cluster iterations (Ivy 2D dataset, k=4) . . . . . . . . . . . . . . . . . 121

5-39 Reconstruction using unsupervised LM-HMM clusters (blue) vs. real trajectory(red) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

5-40 DC-HMM clustering on monkey food grasping task (2classes) . . . . . . . . . . 125

5-41 Coupling coefficient from DC-HMM clustering on monkey food grasping task(2classes) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

5-42 Coupling coefficient from DC-HMM clustering on monkey cursor control task(4classes) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

5-43 Average firing rate per class (4classes,6neurons) . . . . . . . . . . . . . . . . . 127

5-44 Average velocity per class (4classes,6neurons) . . . . . . . . . . . . . . . . . . 128

6-1 Bipartite graph of exemplars (x) and models . . . . . . . . . . . . . . . . . . . . 134

6-2 Hidden state transitions DC-HMM (simulation data 2classes) . . . . . . . . . . 138

6-3 Hidden state transitions DC-HMM (simulation data 2classes) . . . . . . . . . . 139

6-4 Histogram of state models for the DC-HMM (food grasping task) . . . . . . . . 140

11

6-5 State models for the DC-HMM (food grasping task) . . . . . . . . . . . . . . . 141

6-6 Alphas computed per state per channel DC-HMM (food grasping task) . . . . 141

6-7 Alphas across state-space of the DC-HMM (cursor control task) . . . . . . . . 142

A-1 Topology of the linear filter for three output variables . . . . . . . . . . . . . . . 147

C-1 Self-Organizing-Map architecture with 2D output . . . . . . . . . . . . . . . . . 151

12

Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy

DESIGN AND ANALYSIS OF GENERATIVE MODELSFOR

BRAIN MACHINE INTERFACES

By

Shalom Darmanjian

December 2009

Chair: Jose PrincipeMajor: Electrical And Computer Engineering

Brain machine interfaces (BMIs) have the potential to restore movement to patients

experiencing paralysis. Although great progress has been made towards BMIs there

is still much work to be done. This dissertation addresses some of the problems

associated with the signal processing side of BMIs.

Since neural communication within the brain is still unknown, probabilistic modeling

is argued as the best approach for BMIs. Specifically, generative models are proposed

with hidden variables to help model the multiple interacting processes (both hidden and

observable). Some of the advantages of the generative models over the conventional

BMI signal processing algorithms are also confirmed. This includes the modeling of

inhibited neurons and the ability to separate the neural input space. The partitioning

of the input neural space is based on the hypothesis that animals transition between

neural state structures during goal seeking. These neural structures are analogous

to the motion primitives or ’movemes’, exhibited during the kinematics. This leads

to a paradigm shift similar to a divide and conquer methodology but with generative

models. The generative models are also used to cluster the neural input space. This is

appropriate since the desired kinematic data is not available from paralyzed patients.

Most BMI algorithms ignore this very important point.

13

The results are justified with the improvement in trajectory reconstruction.

Specifically, the correlation coefficient on the trajectory reconstruction serves as a

metric to compare against other BMI methods. Additionally, simulations are used to

show the models’ ability to cluster unknown data with underlying dependencies. This is

necessary since there are no ground truths in real neural data.

14

CHAPTER 1INTRODUCTION

Humans learn motor control by physically interacting with the external world. After

learning this control, simple physical tasks are often taken for granted in everyday

life. Drinking a cup of coffee or eating derives automatically from our desires without

consciously planning these simple physical tasks. Additionally, this unique ability to

translate desires into physical movements underlies tool utilization, which separates our

species from other animals. One theory even postulates that without the ability to touch

for socialization, our species would not be on the current evolutionary path [1, 2].

During even simple movement, the human central nervous system must translate

generated firings of millions of neurons while also communicating to the peripheral

nervous system [3]. This complex biological system also continuously manages

chemical and electrical information from the different cortices, and paleocortex structure

to control unconscious and conscious actions taken with the body [3]. The brain must

handle all of these actions while continuously processing visual, tactile and other internal

sensory feedback [3, 4].

Unfortunately, thousands of people have suffered tragic accidents or debilitating

diseases that have either partially or fully removed their ability to effectively interact in

the external world [5]. Some devices exist to aid these types of patients, but often lack

the requirements to live a normal life. Essentially the idea behind motor Brain Machine

Interfaces (BMIs) is to bridge the gap between the brain and the external world to

provide these patients with effective world-interaction.

1.1 Overview of BMIs

A BMI is a system that directly retrieves neuronal firing patterns from dozens to

hundreds of neurons in the brain and then translates this information into desired actions

in the external world. The level of invasiveness in acquiring these firing patterns is

directly related to the level of resolution provided by the recording methodology. An

15

Figure 1-1. BMI overview

electrocorticographic (EEG) system is one method that roughly records from multiple

neurons through the scalp (non-invasive). While Electroencephalographic (Ecog) and

Microelectrode Arrays, although invasive, provide a finer resolution of individual neurons

(or single units). In turn, the finer resolution has allowed for significant advances in the

field of BMIs to take place recently [6, 7]. Consequently, Microelectrode Array data is the

only type of data used throughout this dissertation.

In order to acquire Microelectrode Array Data, multiple electrode grid arrays

are chronically implanted in one or more cortices [8] and record analog voltages

from multiple neurons near each individual electrode. The neural signals (i.e. analog

voltages) then progress through three processing steps (in a typical BMI). First, the

amplified analog voltages recorded from one or more neuron is digitally converted

and passed to a spike detecting/sorting algorithm. A spike detecting and sorting

algorithm essentially identifies if a particular neuron exhibited a firing/voltage pattern

on a corresponding electrode (thereby classifying it as a spike). During the second BMI

16

processing step, the identified discrete spikes are processed by a signal processing

algorithm [8]. Finally, subsequent trajectory/lever calculations are sent to a robot arm or

display device. All of these processing steps occur as an animal engages in a behavioral

experiment (lever press, food grasping, finger tracing, or joystick control).

This dissertation focuses on the signal processing of the neural spikes (the second

processing step described above). Early approaches to modeling BMIs used simple

population vectors, Wiener filters and artificial neural networks [8, 9] between the neural

data and arm trajectory to learn a functional relationship during training. During testing

only the neural data is used to reconstruct the predicted trajectory the animal wanted.

Most of these modeling approaches use binned spikes, commonly referred to as rate

coding, to create the output. Explicit relationships between the neurons are usually

assumed to be independent for modeling purposes and the encoding methodology is

often not even considered [8]. Please see [10, 11] to gain a more detailed understanding

of what encompasses a BMI.

1.2 Experimental Data

This section discusses the type of neural data and kinematic data that will be used

throughout the dissertation. Consequently, the properties of the data help to determine

the appropriate type of models to use.

1.2.1 Monkey Food Grasping Task

For this experiment, an owl monkey uses the right hand to grasp food from four

locations on a tray and then brings the food to its mouth. The recorded neural data is a

sparse time series of discrete firing counts that were extracted from the dorsal premotor

cortex (PMd), primary motor cortex (MI), and posterior parietal cortex (PP) [4, 12]. Each

firing count represents the number of neural firings in a 100ms span of time, which

is consistent with methods used within the neurological community [6, 13, 14]. The

corresponding kinematic data (i.e. hand location) is down sampled from the original

10 Hz to match the 100ms neural spike bins. This particular monkey data set contains

17

104 neural channels recorded for 38.33 minutes. The time recording corresponds to a

dataset of 23000x104 time bins.

Within the neurological community, the question of whether the motor cortex

encodes the arm’s velocity, position, or other kinematic encodings (joint angle, muscle

activity, muscle synergy, etc), continues to be debated [6, 8, 13]. With this particular

experimental data, the monkey’s arm is motionless in space for a brief amount of time

while reaching for food or placing food in its mouth. During this time of ”active holding”

it is unknown if the brain is encoding information to contract the muscles in the holding

position.

Our work as well as other research has shown this active holding encoding is likely

[15]. Due to this belief, the active holding is included as part of the movement class

for the classifiers in this dissertation [16]. An example of this type of data is shown in

Figure 1-2, along with the superimposed gray-scale colors representing the monkey’s

arm movement on the three Cartesian coordinate axes’ (x, y, and z). Note that the

movement and rest classes are labeled by ’hand’ from the 10Hz (100ms) trajectory data.

1.2.2 Monkey Cursor Control

During this experiment, an adult Macaca mulatta monkey performs a manipulandum

behavioral task (cursor control) [4, 8]. The monkey used a hand-held manipulandum

(joystick) to move the cursor (smaller circle) so that it intersects the target. Upon

intersecting the target with the cursor, the monkey received a juice reward. While the

monkey performed the motor task, the hand position and velocity for each coordinate

direction (X and Y ) were recorded in real time along with the corresponding neural

activity. Micro wire electrode arrays chronically implanted in the dorsal premotor

cortex (PMd), supplementary motor area (SMA), primary motor cortex (M1, both

hemispheres) and primary somatosensory cortex (S1), the firing times of up to 185 cells

were simultaneously collected.

18

Figure 1-2. Discritized example of continuous trajectory

For the monkey neural data, each firing count again represents the number of

neural firings in a 100ms span of time. This particular monkey data set contains

185 neural channels recorded for 43.33 minutes. The subsequent time recording

corresponds to a dataset of 26000x185 time bins.

1.2.3 Rat Lever Experiments

There are two rat data sets used for the Rat lever experiments. For the ”single-lever”

data set, thirty-two microwire electrodes were implanted unilaterally in the forelimb

region of primary motor cortex of a male Sprague-Dauley rat [17]. The task requires a

rat to press a single lever for a minimum of 0.5s to achieve a water reward once a LED

visual stimulus is observed. Essentially this go-no-go experiment includes neural data

along with lever presses. This particular data set contains 16 neurons yielding 13000x16

time bins. With each time being the spike sorted count per 100ms of a particular neuron

(sixteen in this data set).

19

For the ”two-lever” rat data set, two 16-microelectrode arrays in the forelimb

regions of each hemisphere [18]. The task requires a rat to press one of two levers

for a minimum of 0.5s to achieve a water reward (also after a LED visual stimulus).

Essentially this go-no-go experiment also provides neural data along with lever presses.

This particular data set contains 42 neurons yielding 19000x42 time bins (100ms each).

For both rat experiments, the rat is free to move around the cage while there is no

measurement of the movement.

1.3 Review of Modeling Paradigms for BMIs

The modeling approaches to BMIs is divided into three categories, supervised,

co-adaptive, and unsupervised, with the majority of BMI modeling algorithms being

supervised. Additionally, most of the supervised algorithms further split into linear

modeling, non-linear modeling, and state-space (or generative) modeling, with again the

majority of BMI algorithms fall under supervised linear modeling.

Supervised linear modeling is traceable to the 80’s and 90’s when neural action

potential recordings were taking place with multi-electrode arrays [6, 8, 9]. Essentially,

the action potentials (called spikes for short), collected with micro-electrode arrays,

are sorted by neuron and counted in time windows (called bins) and fed into a linear

model (Wiener filter) with a predefined tap delay line depending on the experiment

and experimenter [19, 20]. During training, the linear model has access to the desired

kinematic data, as a desired response, along with the neural data recording. Once

a functional mapping is learned between the neural data and kinematic training set,

for testing the linear model is only provided neural data in which the kinematic data is

reconstructed [19].

The Wiener filter is exploited in two ways for this dissertation. First, it serves a

baseline linear classifier (with threshold) to compare results with the models discussed.

Second, Wiener filters are used to reconstruct the trajectories from neural data switched

20

by the generative models. Since the Wiener is very critical to this dissertation and

serves at the core of many BMI systems today, the details are presented in AppendixA.

Along with the supervised linear modeling, researchers have also engaged in

non-linear supervised learning algorithms for BMIs [8, 20]. Very similar to the paradigm

of the linear modeling, the neural data and kinematic data are fed to a non-linear model

that subsequently finds the relationship between the desired kinematic data and the

neural data. During testing, only neural data is provided to the models in order produce

kinematic reconstructions.

With respect to state-space models, Kalman filters have been used to reconstruct

the trajectory of a behaving monkey’s hand [21]. Specifically, a generative model is

used for encoding the kinematic state of the hand. For decoding, the algorithm predicts

the state estimates of the hand and then updates this estimate with new neural data

to produce a posteriori state estimate. Our group at UF and elsewhere found that

the reconstruction was slightly smoother than what input-output models were able to

produce [21].

Unfortunately, there are problems with all of the aforementioned models. First,

training with desired data is problematic since paralyzed patients are not able provide

kinematic data. Second, these models do not capture neurons that fire infrequently

during movement. These neurons are known to exist in the brain and can provide useful

information. But the information is lost with feed-forward filters since a neuron that fires

very little will receive less weighting. Third, there are millions of other neurons not being

recorded or modeled. Incorporating this missing information into the model would be

beneficial. Lastly, all of these models must generalize over a wide range of movements.

Normally, generalization is good for a model of the same task, but generalization across

tasks is problematic. For example, a data set consisting of 3D food grasping movements

and circular 2D movements would require the model to generalize over both kinematic

sets (producing poorer results).

21

The lack of a desired signal therefore necessitates the need for unsupervised or

co-adaptive solutions. Recent Co-adaptive solutions, have relied on the test subject to

train their own brain for goal oriented tasks. One researcher even uses technicians to

supply an artificial desired signal as the input-output models learn a functional mapping

[22]. Digiovanni et al, used reinforcement learning to co-adapt their model and the rat’s

behavior for goal oriented tasks [23].

With respect to unsupervised learning for BMIs, Si et al, use PCA as a feature

extraction preprocessing for SVM and Bayesian classifiers [24]. Their main goal is to

classify actions which move the rat (on a mechanical cart) towards a reward location.

Other research demonstrated that neural state structures exist as animal subjects

engage in movement [25]. This research also demonstrated that by finding partitions

in the neural input space that correspond to specific motor states (hand moving or

at rest), trajectory reconstruction is improved when the models are constructed with

homogeneous data from only one partition [26]. Specifically, a ’switching’ generative

model partitions the animal’s neural firings and corresponding arm movements into

different motion primitives [25]. This improvement is primarily due to the ability of

the generative models to distinguish between neural states, and thereby allowing

the continuous filters to specialize in a particular part of the input space rather than

generalize over the full space. The work also demonstrated that modeling is improved

when spatial dependencies are exploited between neural channels [27, 28].

Other generative model (even graphical models like HMMs) [29–31] work on BMIs

exploit the hidden state variables to decode states taken or transitioned by the behaving

animal. Specifically, Shenoy (et al) found that HMMs can provide representations of

movement preparation and execution in the hidden state sequences [31]. Unfortunately,

this work also requires the use of supervision or a training data that must first be divided

by human intervention (i.e. a user). Therefore to move beyond supervised data, which is

not acquirable from paraplegics, a clustering algorithm is necessary.

22

Although there is little BMI research into clustering neural data with graphical

models, there have been efforts to use HMMs for clustering. The use of HMMs for

clustering appears to have first been mentioned by Juang and Rabiner [32] and

subsequently used in the context of discovering subfamilies of protein sequences by

Krogh et al [33]. Other clustering work focused on single HMM chains for understanding

the transition matrices [34].

The work described in this dissertation moves significantly beyond prior work in two

ways. First, the clustering model finds unsupervised hierarchal dependencies between

HMM chains (per neuron) while also clustering the neural data. Essentially the algorithm

jointly refines the model parameters and structures as the clustering iterations occur.

The hope is that the clustering methodology will serve as a front end for a goal-oriented

BMI (with the clusters representing a specific goal, like ’forward’) or for a Co-adaptive

algorithm that needs reliable clustering of the neural input. Second, the paradigm is

changed to encompass a multiple-model approach to improve performance.

1.4 Dissertation Objectives

The dissertation focuses on finding neural state structures, also called neural

assemblies, which exist as humans/animals engage in movement. Although there has

been work to decompose kinematics into elemental components like motion primitives or

’movemes’, there has been little work on finding the reciprocal sub-structures in neural

data [13, 35]. Since there are no known ground-truths for these types of structures

or definitive partitions, secondary evidence must be provided. Specifically, we argue

that the best way to show that our methodologies are discovering these beneficial

structures is on how well the trajectory prediction is improved. Trajectory improvement

is accomplished by first using our models like a ’switch’ to partition the primate’s neural

firings and corresponding arm movements into different primitives. Similar to a divide

and conquer strategy, by switching or delegating these isolated neural/trajectory data

to different local linear models, prediction of final kinematic trajectories is markedly

23

improved. The models will also address the broad spectrum of neural dependencies,

from independent to implicit and explicit dependencies. After establishing the benefit of

partitioning the input with supervised generative models, two unsupervised clustering

strategies are presented. The first methodology clusters the data samples using the

likelihood as a distance metric and refines the model parameters as if they are centroids

similar to k-means. The second methodology clusters the actual parameters of the

models for the neural state structures in order to refine the parameters. The hypothesis

is that by observing the neural state evolution, neurophysiologic understanding can be

gained about the neural interactions. Simulated neural data will also be used on the

models in order to control and observe different aspects of the clustering results.

Chapter 2 provides background on graphical models since they form the core

framework of the dissertation. The chapter will provide details of related work on these

generative probabilistic models. The chapter will also cover in detail Hidden Markov

Models (HMMs) since they are one particular graphical model used for modeling in this

dissertation.

Chapter 3 discusses the hierarchal decomposition of neural data and outlines some

models that temporally and spatially model the data. In particular, the chapter details the

independently coupled Hidden Markov Model (IC-HMM) and how it models neural data.

The chapter also discusses the Boosted Mixture of HMMs (BM-HMM) and discusses the

implicit hierarchal relationship between neural channels that are formed with this model.

Then the Linked Mixtures of HMMs are discussed and how this model explicitly develops

dependencies between neural channels. Finally, the chapter ends with a presentation

of the Dependently Coupled HMMs (DC-HMM) model. This model explicitly models

dependencies through time across the channels.

Chapter 4 will present the results from the broad spectrum of models discussed in

Chapter 3, using real neural data from the three animal experiments (discussed earlier).

24

The results are compared against other models by using the correlation coefficient and

classification performance.

Chapter 5 covers a model based clustering methodology using the LM-HMM and

DC-HMM models. Real neural data as well as simulated neural data will be used to

test the clustering methodology. Chapter 6 presents the conclusion and includes a

discussion on using the neural state transitions as features for another unsupervised

clustering methodology. Future work will be postulated as well as preliminary results

from this neurophysiologic clustering perspective.

25

CHAPTER 2GENERATIVE MODELS

2.1 Motivation

As discussed in the Chapter 1, many BMI researchers use linear and non-linear

models to find a mapping between desired kinematic data (exhibited by a behaving

animal) and neural data. Unfortunately, the reality of this type of experiment is that the

mapping of a single individual will not correspond to the mapping of another individual

[3]. Additionally, the patients that are involved in BMI research are paralyzed, which

prevents kinematic data from being recorded during the training of a model. In order

to provide paralyzed patients a direct ability to interact in the external world, the BMI

solution will need to extract as much information from the neural input space.

Unfortunately, in the neural input space there are many problems. First, only a

few neurons are being sampled from millions. Second, information about the physical

connectivity among these sampled neurons is not available with current technology

[6, 8]. Although some histological studies can be done to stain for the type of neurons

acquired, they do not provide information about their inter-communication (and animals

are euthanized for these studies) [6]. Third, the sample of neurons acquired in one

experiment will not be the same neurons nor represent the same motor functions in

another experiment with different patients [3]. Additionally, some neurons in the sample

may not even contribute to the task being modeled.

With the absence of information, a probabilistic approach is the best approach

to model what is observable from the brain. Modeling the unknown hidden neural

information is accomplished with observable and hidden random processes that are

interacting with each other. Specifically, we make the assumption that each neuron’s

output is an observable random process that is affected by hidden information. Since

the experiment does not provide detailed biological information about the interactions

between the sampled neurons, hidden variables are used to model these hidden

26

interactions [3, 6]. We further assume that the compositional representation of the

interacting processes occurs through space and time (i.e. between different neurons

at different times). Graphical models are the best way to model and observe this

interaction between variables in space and time [36]. Another benefit of a state-space

generative model over traditional filters is that neurons that fire less during certain

movements can be modeled simply as another state rather than a low filter weight value.

Neuroscientists often treat the multiple channels of neural data acquired from BMI

experiments as multivariate observations from a single process [10]. This perspective

requires fully coupled statistics across all of the channels at all times irrespective of

partial independence among the multiple processes. Our group’s own work has shown

that modeling this data as a single multivariate process is not the most appropriate [16].

The models described within this dissertation also differ from other work in the BMI field

since they are not a traditional regression approach of mapping the neural data directly

to the patient hand kinematics with a conventional linear/non-linear model [10]. Instead,

generative models are used to divide the input space into regions that represent cell

assemblies which are referred throughout the dissertation as neural state structures.

These structures have been loosely touched upon in other work [19, 37]. The basic

idea is that kinematic information or neural data is decomposed into ”motion primitives”

similar to phonemes in speech processing.

In speech processing, graphical models (specifically HMMs) are the leading

technology because they are able to capture very well the piecewise non-stationarity

of speech [32, 38]. Since speech production is ultimately a motor function, graphical

models are potentially useful for motor BMIs (also non-stationary) [32, 39]. By using

smaller simpler components, more complicated arm kinematics are constructed

through the combination of these simple structures. We hypothesize that there are

decomposable structures in the input space that is analogous to the primitives in the

kinematic space.

27

The ultimate goal is to decipher these underlying structures so that the model may

one day be decoupled from the desired kinematics (for unsupervised modeling). The

goal of Chapter 4 is to determine these underlying structures through a supervised

mode in order to later exploit them in an unsupervised mode (Chapter 5).

2.2 Related Work

Graphical models have been used as a probabilistic model for brain activity. In the

early 90’s, Radons et al used HMMs to code the information contained in the neural

activity of a monkey’s visual cortex during different visual stimuli [29]. Later, Gat et

all used an HMM to discover the underlying hidden states during a go-no-go monkey

experiment. The HMM model provided insight into the underlying cortical network

activity of behavioral processes [30]. Specifically, they could identify the behavioral

mode of the animal and directly identify the corresponding collective network activity

[30]. Additionally, by segmenting the data into discrete states Radons et al demonstrated

that there may be dependency of the short-time correlation between neural cells.

In recent years there has been a renewed interest in using generative models for

BMIs. Other researchers use a hybrid generative model to decode trajectories [21]. In

their model, they incorporate neural states and hand states similar to other filter work

but solely in a probabilistic framework. For the continuous hand state they use the

mixture-of-trajectories model.

Although not explicitly for BMIs, graphical models have been used to model the

brain through belief propagation [40]. In particular this work has focused on the visual

cortex and detection of motion using an HMM.

2.3 Background on Graphical Models

Graphical models incorporate different aspects of probability theory with graph

theory. In the field of machine learning, graphical models play an increased role since

they handle uncertainty and provide a decrease in complexity. They achieve this by

using simple models to build complex systems. Since probability theory ensures that

28

the systems are consistent, graphical models also helps describe the data. Specifically,

graphical models provide an intuitive way to understand the interaction between multiple

variables as well as the structure so that efficient algorithms are tailor-made. Many

scientific fields with multivariate probabilistic systems implement the special cases of

graphical models, like mixture models, factor analysis, Hidden Markov models, and

Kalman filters.

In the framework of graphical models, nodes are used to represent random

variables and arcs (or lack of) represent assumptions of conditional independence.

In turn this provides a compact representation of joint probability distributions. As an

example, if N binary random variables represent the joint P (X1, ..., Xn), then O(2N)

parameters are needed, whereas a graphical model may need much fewer, depending

on the a-priori assumptions. Consequently this type of decomposition helps with

inference and learning.

There are two main kinds of graphical models, undirected and directed. Undirected

graphical models, also known as Markov random fields (MRFs), make no prior

assumptions about causal relationships. Directed graphical models, also known

as Bayesian networks, Belief networks, causal models, generative models, etc,

make assumptions about causal relationships between variables. For example, let Z

represent the set of variables (both hidden and observed) included in the probabilistic

model. A graphical model (or Bayesian network) representation provides insight

into the probability distributions over Z encoded in a graph structure [41]. With this

type of representation, edges of the graph represent direct dependencies between

variables. As stated earlier, the absence of an edge allows the assumption of conditional

independence between variables. Ultimately, these conditional independencies allow a

more complicated multivariate distribution to be decomposed (or factorized) into simple

and tractable distributions [41].

29

Since there are a variety of graphical model representations that decompose

the joint probability of the hidden and observed variables in Z, choosing the best

approximation can be overwhelming. For this dissertation, directed graphical models are

used since they help explain the hidden relationships between the random processes

occurring with the sampled neurons of the brain. One particular directed graphical

model that is discussed throughout this dissertation is the Hidden Markov Model (HMM).

2.3.1 Hidden Markov Models

The HMMs discussed in this dissertation are discrete-output HMMs since they

are computational simple and less sensitive to initial parameter settings during training

[32]. Consider the graphical model in Figure 2-1E, which represents a hidden Markov

model (HMM). This type of structure decomposes the joint distribution. For a sequence

of length T, we simply ”unroll” the model for T time steps. The Markov property states

that the future is independent of the past given the present. This Markov chain is

parameterized with the triplet, λ = {A,B , π} , where A is the probabilistic NXN

state transition matrix, B is the LXN output probability matrix (with L discrete output

symbols), and π is the N-length initial state probability distribution vector [32, 42]. If

these parameters are treated as random variables (as in the Bayesian approach),

parameter estimation becomes equivalent to inference. If the parameters are treated as

unknown quantities, parameter estimation requires a separate learning procedure.

In order to maximize the probability of the observation sequence O, the model

parameters (A,B, π) must be estimated. Maximizing the probability is a difficult task;

first, there is no known way to analytically solve for the parameters that will maximize

the probability of the observation sequence [32]. Second, even with a finite amount

of observation sequences it is unlikely to find the global optimum for the parameters

[32]. In order to circumvent this issue, the Baum-Welch method can iteratively choose

λ = {A, B, π} that will locally maximize P (O|λ)) [42].

30

Specifically, for the Baum-Welch method first uses the current estimate of the HMM

λ = {A,B, π} and an observation sequence O = {O1, ..., OT} to produce a new estimate

of the HMM given by λ = {A, B, π}, where the elements of the transition matrix A,

aij = (

∑T−1t=1 ζt(i, j)∑T−1t=1 γt(i)

), i, j ∈ {1, ..., N}. (2–1)

Similarly, the elements for the output probability matrix B,

bj(k) =

∑t γt(j)(where ∀Ot = vk)∑T

t=1 γt(j), j ∈ {1, .., N}, k ∈ {1, ..., L}, (2–2)

and finally the π vector,

πi = γ1(i), i ∈ {1, ..., N}, (2–3)

where,

ζt(i, j) =αt(i)aijbj(Ot+1)βt+1(j)

P (O|λ)and (2–4)

γt(i) =N∑

j=1

ζt(i, j). (2–5)

Please note, β is the backward variable, which is similar to the forward variable

α except that now the values are propagated back from the end of the observation

sequence, rather than forward from the beginning of O [42].

Specifically the α quantity is recursively calculated by setting

α1(j) = πjbj(o1) (2–6)

αt+1(i) = [N∑

j=1

αt(j)aij]bi(ot+1) (2–7)

The well known backward procedure is similar

βt(j) = P (Ot+1 = ot+1, ..., OT = oT |St = j, Θ) (2–8)

31

Figure 2-1. Various HMM structures

this computes the probability of the ending partial sequence ot+1, ...oT given the start at

state j at time t. Recursively, βt(j) is defined as

βT (j) = 1 (2–9)

βt(j) =N∑

k=1

aijbi(ot+1)βt+1(i) (2–10)

2.3.2 Moving Beyond Simple HMMs

As discussed earlier, there are a variety of HMM architectures that have been

proposed to address a specific class of problems and to overcome certain limitations

in the traditional HMM. This section will just outline a few models (shown in Figure 2-1)

that are related to finding dependencies within the hidden state space. The standard

fully-coupled HMMs (Figure 2-1A) generally refer to a group of HMM models in which

32

the state of one model at time t depends on the states of all models (including itself) at

time t − 1. For C HMMs coupled together the state transition probability is described as

P (S(c)t |S(1)

t−1, S(2)t−1...S

(C)t−1) instead of P (S

(c)t |S(c)

t−1) as in a single standard HMM model. In

other words, the state transition probability is described by a (C + 1) dimensional matrix

and the number of free parameters for this transition probability matrix is NC, which is

exponential in the number of models coupled together (assume the number of hidden

state N is the same for all models) [43]. Parameter learning is very difficult with this type

of structure. Some researchers have created variations of the fully-coupled HMMs in

order to decrease the model size and ease the complexity of inference. Coupled HMMs

[44] proposed by Matthew Brand models the joint conditional dependency as the product

of all marginal conditional probabilities, i.e.

P (S(c)t |S(1)

t−1, S(2)t−1, , , S

(C)t−1) =

C∏c=1

P (S(c)t |S(c)

t−1) (2–11)

This simplification reduced the transition probability parameter space. It has been

used for recognizing complex human actions/behaviors [44, 45]. But Brand did not

give any assumption or condition under which this equation can hold [43]. Additionally,

there are no parameters to directly capture the interaction between channels. Another

variation of the fully coupled HMMs approximates the EM algorithm with particle filtering.

This particular model was used to model freeway traffic [46]. Figure 2-1B is a specific

coupled HMMs called an event-coupled HMMs [47]. This model is used for a class

of loosely coupled time series where only the onset of events is coupled in time. The

factorial HMM [48] (Figure 2-1C) represents the opposite of the CHMM by using multiple

hidden state chains to represent a single chain of observables. This model is not

exploiting the interaction or dependency between multiple models. In Figure 2-1D, the

IO-HMM is used for modeling the input-output sequence pair. Although the IO-HMM is

similar to the coupled HMM the input used in IO-HMM and hidden state from previous

33

time slice are different, certain independent assumption of inputs does not apply to the

hidden states and the inference algorithm used in [49] is only for one HMM model and

again not for general multiple coupled HMMs [43]. Chapter 3 will discuss an alternative

and more reasonable formulation of the CHMM which also reduces the parameter space

but includes parameters that capture the coupling strength between HMMs.

34

CHAPTER 3BRAIN MACHINE INTERFACE MODELING: THEORETICAL

3.1 Motivation

In the initial experimentation with HMMs, we treated the neural data as multivariate

features of a single process. First, the multi-dimensional neural data was converted into

a set of discrete symbols for the discrete-output HMM’s [16] using the Linde-Buzo-Gray

(LBG) VQ algorithm [50]. Then a single HMM was trained with these symbols corresponding

to neural data of a particular class (movement vs. rest for the monkey food grasping

task). Unfortunately, this switching classifier only maximally achieved 87% classification

accuracy [16]. Although, these results lent support to using HMMs on BMI data since

trajectory reconstruction was improved over a single linear model, the results were only

fair and motivated further investigation.

In trying to improve upon this performance, the relevancy of particular neurons to

a respective task (i.e. movement or rest) was explored in order to congregate different

neurons into corresponding subsets. To quantify the differentiation, we examined how

well an individual neuron classifies movement vs. rest when trained and tested on an

individual HMM chain. Since each neural channel is binned into a discrete number of

spikes per 100ms, the neural data was directly used as input.

During the evaluation of these particular HMM chains, the conditional probabilities

for rest or movement are respectively computed as P(O (i)|λ(i)r ) and P(O (i)|λ(i)

m ) for

the i -th neural channel, where O (i) is the respective observation sequence of binned

firing counts and λ(i)m , λ

(i)r represent the given HMM chain parameters for the class of

movement and rest, respectively. To give a qualitative understanding of these weak

classifiers, Figure 3-1 presents the probabilistic ratios from 14 single-channel HMM

chains (shown between the top and bottom movement segmentations) that produced the

best classifications individually. Specifically, the figure illustrates the simple ratio

P (O(i)|λ(i)m )

P (O(i)|λ(i)r )

(3–1)

35

for each neural channel in a gray scale gradient format. The darker bands represent

ratios larger than one and correspond to a higher probability for the movement class.

Lighter bands represent ratios smaller than one and correspond to a higher probability

for the rest class. The conditional probabilities nearly equal to one another show

up as gray bands, indicating that classification for the movement or rest classes is

inconclusive. As further support, Table 3-1 quantitatively shows that the individual neural

channels roughly classify the two classes of data better than chance. Essentially if

the likelihood ratio was larger than one the sample was labeled movement while less

than one became a label for rest. The model was trained on set of 8000 samples and

classification was computed on a separate test set of 3000 samples.

Overall, Figure 3-1 illustrates that the single-channel HMMs roughly classify

movement and rest segments from the neural data (better than random). Specifically,

the figure shows more white bands P(O |λr) >> P(O |λm) during rest segments and

darker bands P(O |λm) >> P(O |λr) during movement segments. Figure 3-2 is a

zoomed-in picture of Figure 3-1. This figure reveals that some of the neural channel

classifiers tune to certain parts of the trajectory like rest-food, food-mouth, and

mouth-rest. These sub-segmentations lend further support to our hypothesis that

primitives within the data exist.

Having observed the ability of some neurons to roughly classify the two classes

while other neurons were poor, the issue now becomes how to combine the information

properly while maintaining computational simplicity and the removal of VQ. The next

Table 3-1. Classification performance of example single-channel HMM chains

Neuron # Rest Moving

23 83.4% 75.0%

62 80.0% 75.3%

8 72.0% 64.7%

29 63.9% 82.0%

72 62.6% 82.6%

36

Figure 3-1. Probabilistic ratios of 14 neurons

Figure 3-2. Zoomed in version of the probabilistic ratios

sections detail the structure and training of different models that merge the classification

performance of these individual classifiers to produce an overall classification decision.

3.2 Independently Coupled HMMs

Structure and training for the ICHMM.

As discussed in Chapter 2, the CHMM models multiple channels of data without

the use of multivariate pdf’s on the output variables [44]. Unfortunately, the complexity

(O(TN2D) or O(T (DN)2)) and number of parameters necessary for the CHMM (and

37

variants) grows exponentially with the number of D chains (i.e. neural channels), N

states and T observation length. For these particular datasets, there is not enough data

to adequately train this type of classifier (under most training procedures). Therefore,

a model must be devised that supports the underlying biological system and can

successfully use the neural data without being intractable.

The neural science community offers a solution that overcomes these shortcomings

while still supporting the underlying biological system. The literature in this area explains

that different neurons in the brain may modulate independently from other neurons [3]

during the control of movement. Specifically, during movement, different muscles may

activate for synchronized directions and velocities, yet, are controlled by independent

neural assemblies or clusters in the motor cortex [3, 14, 51].

Conversely, within the neural clusters themselves, temporal dependencies (and

co-activations) have been shown to exist [3]. Therefore, this classifier assumes

that enough neurons are sampled from different neural clusters to avoid overlap

or dependencies. This assumption is further justified by looking at the correlation

coefficients (CC) between all the neural channels in our data set.

The best CC’s (0.59, 0.44, 0.42, 0.36) occurred between only four out of the 10,700

neural pairs while the rest of the neural pairs were a magnitude smaller (abs(CC)<0.01).

Additionally, despite these weak underlying dependencies, there is a long history of

making such independence assumptions in order to create models that are tractable or

computationally efficient. The Factorial Hidden Markov Model is one example amongst

many [44, 48]. Although an independence assumption is made between the neurons

to simplify the IC-HMM, other models will exploit these weak dependencies as well as

more complicated (spatial-temporal) dependencies in later sections.

By making an independence assumption between neurons, each neural channel

HMM is treated independently. Therefore the joint probability

P (O(1)T , O

(2)T , ...O

(D)T , |λfull) (3–2)

38

Figure 3-3. IC-HMM graphical model

becomes the product of the marginals

D∏i=1

P (O(i)T |λ(i)) (3–3)

of the observation sequences (each length T) for each d th HMM chain λ. Since the

marginal probabilities are independently coupled, yet try to model multiple hidden

processes, this naive classifier (Figure 3-3) is named the Independently Coupled Hidden

Markov Model (ICHMM) in order to keep the nomenclature simple.

By using an ICHMM instead of a CHMM, the overall complexity reduces from

(O(TN2D) or O(TD2N2)) to O(DTN2) given that each HMM chain has a complexity

of O(TN2). Since a single HMM chain is trained with a respective neural channel, the

number of parameters is greatly reduced, consequently reducing the requirements in the

amount of training data. Specifically, the individual HMM chains in the ICHMM contain

around 70 parameters for a training set of 10,000 samples as opposed to almost 18,000

parameters necessary for a comparable CHMM (due to the dependent states).

39

The detailed ICHMM structure is as follows:

1. Using a single neural channel d , the conditional probabilities are evaluated P (O(d)T |λ(d)

r )

and P (O(d)T |λ(d)

m ) , where,

O(d)T = {O (d)

t−T+1 ,O(d)t−T+2 ,O

(d)t−1 ,O

(d)t }(d),T > 1 (3–4)

and λ(d)r and λ

(d)m denote HMM chains that represent the two states of the monkey’s arm

(moving vs. rest). All of the HMM chains are previously trained with their respective

neural channels using the Baum-Welch algorithm [32, 42] described in Chapter 2.

Based on empirical testing, three hidden states and an observation sequence length T

of 10 are chosen for the model. With the dataset described, an observation length of 10

corresponds to a second of data (given the 100ms bins).

2. Normally, the monkey’s arm is decided to be at rest if, P (OT |λr) > P (OT |λm) and is

moving if, P (OT |λm) > P (OT |λr), but in order to combine the predictive powers of all the

neural channels, Equation ( 3–3) is used to produce the decision boundary

D∏i=1

P (O(i)T |λ(i)

m ) >

D∏i=1

P (O(i)T |λ(i)

r ) (3–5)

or more aptly,

l(O) =

∏Di=1 P (O

(i)T |λ(i)

m )∏Di=1 P (O

(i)T |λ(i)

r )> ζ (3–6)

where l(O) is the likelihood ratio, a basic quantity in hypothesis testing [38, 40].

Essentially, ratios greater than the threshold ζ are classified as movement and those

less than ζ as the rest class. The use of thresholds for the likelihood ratio has been

used in neural science and other areas of research [38, 40]. Often, it is more common

to use the log-likelihood ratio instead of the likelihood ratio for the decision rule so that

a relative scaling between the ratios can be found (as well as suppressing any irrational

ratios) [38]:

40

log(l(O)) = log(D∏

i=1

P (O(i)T |λ(i)

m )

P (O(i)T |λ(i)

r )) =

D∑i=1

log(P (O

(i)T |λ(i)

M )

P (O(i)T |λ(i)

r )) (3–7)

By applying the log to the product of the likelihood ratios, essentially the sum of the

log likelihood ratios is found to see if it is larger or smaller than a threshold (log ζ). In

simple terms, this decision rule poses the question of how large is the probability for one

class compared to the other and is it occurring over a majority of the single classifiers.

Note that by varying the threshold log(ζ), classification performance is tuned to fit any

particular requirements for increasing the importance of one class over another. For this

experiment equal importance is assumed for the classes (no bias for one or another).

Moreover, optimization of the classifier is now no longer a function of the individual HMM

evaluation probabilities, but rather a function of overall classification performance.

The next section outlines methods that move away from the naive assumption of

independence between the neural channels. Specifically, implicit relationships between

neurons will first be explored and then explicit relationships.

3.3 Boosted Mixtures of HMMs

3.3.1 Boosting

Boosting is a technique that creates different training distributions from an initial

input distribution so that a set of weak classifiers is generated [52, 53]. The generated

classifiers then form an ensemble vote for the current data example. These hierarchical

combinations of classifiers are capable of achieving lower error rates than the individual

base classifiers [52, 53]. Therefore boosting will be used to move beyond the IC-HMM

and exploit the complimentary information provided by the independent HMM chains.

Adaboost is the most widely used algorithm to evolve from boosting methods [54].

This algorithm sequentially generates weak classifiers based on weighted training

examples. Essentially, the initial distribution of training examples is re-sampled each

round (based on the distribution of the weights Wi) in order to train the next classifier up

41

to R rounds [54]. The training examples that fail to be classified on a particular round

receive an increased weighting so that the subsequent classifiers are more likely to be

trained on these hard examples. With the initial values of the weights being Wi= 1n, for

i=1,...,N samples, the update for each weight is

Wi ←Wi exp[αr· 1(y 6=fr(x))]

Zr

(3–8)

Where αr are the external weights for each r-th expert that has been trained for the

r-th round which act very similar to priors for the respective experts. Additionally, Zr is

a normalization factor in order to make the weights Wi a distribution. Concurrent to the

weights Wi for training examples, each αr is updated for the final ensemble vote (which

is a linear combination of the αr weights and the hypothesis of each expert).

αr = log1− errr

errr

(3–9)

where,

errr =N∑

i=1

Wi[yi 6= h(xi)] (3–10)

and the final ensemble output becomes

H(x) = sign [R∑

r=1

αrfr(x)] (3–11)

The success of boosting has been attributed to the distribution of the ”margins”

of the training examples [52]. For further details on Adaboost or boosting and the

relationship to the margin or support vector machines, see [54].

With respect to improving IC-HMM, Adaboost offers a promising way to implicitly

find dependencies among the channels through the process of boosting classifiers

during training. In the next section, Adaboost is applied differently to the multidimensional

neural data.

42

3.3.2 Modeling Framework

BMI data imposes specific constraints that require modifications to the standard

procedures: First, the neural data is high dimensional (100-200 channels) and the

importance of each channel is unknown. Second, the classes have markedly different

prior probabilities. Third, the computational practicality of using thousands of experts to

generate a decision is infeasible for hundreds of neural channels. Our approach uses

Adaboost with a competitive component as a way to select multichannel experts that

contribute the most information in the training set. Although the algorithm starts out with

parallel training for the independent experts, gradually a winning expert is chosen for

each hierarchical level to combine later into the ensemble. Since each level is formed in

an unsupervised way, this algorithm is a hybrid between supervised and unsupervised

learning, following a process similar to the Mixture of Experts framework [55].

The first major departure from Adaboost, comes from how the ensemble is

generated. Instead of forming one expert at a time, the C independent HMM chains

are trained in parallel using the Baum-Welch formulation (C=the number of neural

channels)[42]. This divides the joint likelihood into marginals so that independent

processes are working in simpler subspaces of the input [25]. Then a competitive

phase is initiated. Specifically, a ranking is performed with the experts and a winner

is chosen based on the classification performance for the current distribution of input

examples. The winner that minimizes the error with respect to the distribution of samples

is chosen. The minimal error is calculated using a Euclidean distance (also a departure

from Adaboost) in order to avoid biasing class assignments (since classes may not

have equal priors)[25]. Next, the remaining experts are trained within their respective

subspace but relative to the errors of the previous winner. Finally, the Wi are used to

select the next distribution of examples for the remaining experts. Similar to Adaboost,

the remaining experts are trained on the hard examples from different subspaces. In

turn, a hierarchical structure is formed as the winning experts affect the training on the

43

local subspaces for the subsequent experts. During this process, the model is implicitly

modeling the dependencies among the channels.

As explained earlier, Adaboost uses αc’s as external weights to the classifiers as

opposed to the Wi’s which weight the training examples. The computation of αm’s is

the second major departure from Adaboost since a mixture of experts formulation is

used for the external weights or mixture coefficients. Boosted Mixture of Experts (BME)

provides the inspiration in finding the mixture coefficients for the local classifiers [56].

With BME, improved performance is gained through the use of a confidence measure

for the individual experts [56]. Although many different confidence measures exist, the

majority use a scalar function of the expert’s output which is then used as a static gating

function or mixture coefficient [55, 56]. The algorithm uses a simple measure for each

expert based on the L2-Norm of the class errors (instead of the one outlined in equation

3–9)

αc = 1−√

err2M + err2

R (3–12)

where errM and errR are the respective errors of the two classes in our problem, Move

and Rest (which could generalize to more classes). The variable βc is substituted for the

normal Adaboost formulation of αc (Equation 3–9) to update the Wi’s in Equation 3–8.

Since there is a condition placed during the boosting phase to discard experts with

less than 50% classification, negative alphas will not occur [54]. Notice that as the errors

between the two classes are smaller, the weights for the experts become larger. The

proposed Adaboost training algorithm is presented below.

Given: (x1, y1), ..., (xn, yn) where xi ∈ X, yi ∈ Y = {−1, +1}Initialize Wi = 1

N, i = 1, ..., N samples

For l = 1, ..., L rounds

• Train all of the HMMs using samples from distribution Wi (with replacement)

44

• Find the c-th expert that minimizes the error with respect to the distribution Wi

errc = EW [1(y 6=fc(x))],

• Choose αc = 1−√

err2M + err2

R

βc = log1− errc

errc

• Update:

Wi ←Wi exp[βc· 1(y 6=fc(x))]

Zl

• The ensemble output:

H(x) = sign [C∑

c=1

αcfc(x)]

Where Zl is a normalization factor, so that Wi will be a distribution and∑

i Wi = 1.

The criterion for stopping is based on two conditions. The first stopping condition

occurs if the chosen experts are performing less than 50% classification. The second

stopping condition occurs if the cross validation set shows an increase in error or a

plateau in performance for a significant number of rounds.

With respect to other algorithms in the machine learning community, there are a

few ways to interpret the BM-HMM. BM-HMMs can be thought of as a modification to

boosting, or even a simpler version of the Mixture of Trees algorithm if the HMM chains

are interpreted as binary stumps [63]. Additionally, the temporal Markovian dynamics

coupled with the hierarchical structure and mixture modeling can be thought of as a

simple approximation to tree structured HMMs [41]. Other work has focused on solving

this problem of boosting multiple parallel classifiers [64]. Other authors have proposed

boosting solutions that reduce the dimensionality of the input data [64, 69, 70]. From

their perspective, the multidimensional inputs are treated as simple features of a single

45

random process to be modeled [64]. Our model is different since the input space is

treated as multiple random processes that are interacting with each other in some

unknown way. By decomposing the input space into multiple random processes, the

local contributions of the individual processes are exploited in a competitive fashion

rather than using the global effect of a single process. This type of algorithm also has

components similar to the Mixture of Experts (MOE) algorithm.

Since a single HMM chain is trained on a single neural channel, the number of

parameters is very small and can support the amount of training data. As discussed with

the IC-HMM, the individual HMM chains in the BM-HMM contain around 70 parameters.

In the next section, the complexity of the BM-HMM is slightly increased in order to

explicitly model dependencies between the neurons.

3.4 Linked Mixtures of HMMs


To move beyond the IC-HMM and implicit dependencies in the BM-HMM, another

layer of hidden or latent variables is established to link and express the spatial

dependencies between the lower level HMM structures (Figure 3-4), thus creating

a clique tree structure T (since there are cycles), where the hierarchical links exist

between neural channels.

The log likelihood of the dynamic neural firings from all of the neurons for this

structure (Figure 3-4) is

P (O|Θ) =∑Q

∑M

P (O,Q, M |Θ) (3–13)

46

Figure 3-4. LM-HMM graphical model

logP (O|S, M, Θ) = log P (M1) +N∑

i=2

log P (M i|M i−1, Θ) +

N∑i=1

(T∑

t=1

logP (Oit|Si

t , Θi) +

T∑t=1

logP (Sit |Si

t−1,Mi, Θi)) (3–14)

Where the dependency between the tree cliques are represented by a hidden

variable M in the second layer,

P (M i|M i−1, Θi) (3–15)

and the hidden state sequence S also has a dependency on the hidden variable M in

the second layer

P (Sit |Si

t−1,Mi, Θi) (3–16)

This hierarchical structure models the data across multiple neural channels while

also exploiting dependencies. Figure 3-4, shows how the lower observable variables Oi

are conditionally independent from the second layer hidden variable M i, as well as the

47

sub-graphs of the other neural channels T j (where i 6= j). The hidden variable M in the

second layer of Equation 3–14 is interpretable as a mixture variable (when excluding the

hierarchical links).

The LM-HMM implements a middle ground between making an independence

assumption and a full dependence assumption. Since a layer of hidden variables is

added, the computational cost increases. The next section will detail an approximation

that eases computational costs but still maintains the richness of modeling the

interrelationships between neurons.

3.4.2 Training with Expectation Maximization

Although using EM with graphical models provides insight into probability distributions

over the observed and hidden variables, some probabilities of interest are intractable

to compute. Instead of using brute force methods to evaluate such probabilities,

conditional independencies represented in the graphical model can be exploited. Often,

approximations like Gibbs sampling, variational methods, and mean field approximations

are applied in order to make the problem tractable or computationally efficient [41, 79].

For this model, a mean field approximation is used to allow interactions associated

with tractable substructures to be taken into account [79]. The basic idea is to associate

with the intractable distribution a simplified distribution that retains certain terms of the

original distribution while neglecting others, replacing them with parameters ui that

often referenced as ”variational parameters”. Graphically, the method can be viewed as

deleting edges from the original graph until a forest of tractable structures is obtained.

Edges that remain in the simplified graph correspond to terms that are retained in the

original distribution and edges that are deleted correspond to variation parameters

[79, 81].

Approximations are used to find the expectation of Equation 3–14. In particular,

P (Sit |Si

t−1,Mi, Θi)) is first approximated by treating M i as independent from S making

conditional probability equal to the familiar P (Sit |Si

t−1, Θi)). Two important features

48

are seen in this type of approximation. First, the simple lower-level HMM chains are

decoupled from the higher-level M i variables. Second, M i are now regarded as a linked

mixture variable for the HMM chains since P (M i|M i−1, Θi) which are addressed later

[55].

Because the lower-level HMMs have been decoupled, the Baum-Welch formulation

can now be used to compute some of the calculations in the E-step, leaving estimation

of the variational parameter for later. As a result, the forward pass is calculated as

E step:

αj(t) = P (O1 = o1, ..., Ot = ot, St = j|Θ) (3–17)

This quantity is calculated recursively by setting:

αj(1) = πjbj(o1) (3–18)

αk(t + 1) = [N∑

j=1

αj(t)ajk]bk(ot+1) (3–19)

The well known backward procedure is similar

βj(t) = P (Ot+1 = ot+1, ..., OT = oT |St = j, Θ) (3–20)

this computes the probability of the ending partial sequence ot+1, ...oT given the start at

state j at time t. Recursively, we define βj(t) as

βj(T ) = 1 (3–21)

βj(t) =N∑

k=1

ajkbk(ot+1)βk(t + 1) (3–22)

Additionally, the ajk and bj(ot) matrices are the transition and emission matrices

defined for the model which are updated in the M-step. Continuing in the E-step the

posteriors are rearranged in terms of the forward and backward variables. Let

γj(t) = P (St = j|O, Θ) (3–23)

49

which is the posterior distribution. Rearrange the equations to quantities to produce:

P (St = j|O, Θ) =P (O, St = j|Θ)

P (O|S, Θ)=

P (O,St = j|Θ)∑Nk=1 P (O, St = k|Θ)

(3–24)

and now with the conditional independencies the posterior is defined in terms of α’s

and β’s

γj(t) =αj(t)βj(t)∑N

k=1 αk(t)βk(t)(3–25)

We also define

ξjk(t) = P (St = j, St+1 = k|O, Θ) (3–26)

When expanded

ξjk(t) =P (St = j, St+1 = k, O|Θ)

P (O|S, Θ)=

αj(t)ajkbk(ot+1)βk(t + 1)∑Nj=1

∑Nk=1 αj(t)ajkbk(ot+1)βk(t + 1)

(3–27)

The M-step departs from the Baum-Welch formulation and introduces the variational

parameter [79]. Specifically, the M-step involves the update of the parameters πj, ajk, bL

(we will save ui for later)

M step:

πij =

∑Ii=1 uiγ

i1(j)∑I

i=1 ui

(3–28)

aijk =

∑Ii=1 ui

∑T−1t=1 ξi

t(j, k)∑Ii=1 ui

∑T−1t=1 γi

t(j)(3–29)

bij(L) =

∑Ii=1 ui

∑Tt=1 δot,vL

γit(j)∑I

i=1 ui

∑Tt=1 γi

t(j)(3–30)

There are two issues left to solve. First, how can the variational parameter be

estimated and maximized given the dependencies. Second, if experimentally it is not

known which neurons are affecting other neurons (if at all), how can the dependencies

between neurons be defined in the model.

50

3.4.3 Updating Variational Parameter

While still working within the EM framework, the variational parameters ui are

treated as mixture variables generated by the i’th HMM each having a prior probability

pi. The set of parameters are estimated to maximize the likelihood function [34, 80, 81]

n∏z=1

I∑i=1

piP (Ozi|Si, Θi) (3–31)

Given the set of sequences and current estimates of the parameters the E-step

consists of computing the conditional expectation of hidden variable M

uzi = E[M i|M i−1, Ozi|Θi] = Pr[M i = 1|M i−1 = 1, Ozi, Θi] (3–32)

The problem with this conditional expectation is the dependency on M i−1. Since

M i−1 is independent from Oi and Θi we decompose this into

uzi = E[M i|Ozi, Θi]E[M i|M i−1] (3–33)

The first term, a well-known expectation for Mixture of Experts, is calculated by

using Bayes rule and the priori probability that M=1

E[M i|Oi, Θi] = Pr[M i=1|Oi, Θi] =piP (Oi|Si, Θi)∑Ii=1 piP (Oi|Si, Θi)

(3–34)

Since the integration for the second term is much harder to compute, an integration

approximation is used that will also maintain the dependencies. Importance sampling

is a well-known method that is capable of approximating the integration with a lower

variance than Monte-Carlo integration [82]. We approximate the integration with

E[M i|M i−1] =1

n

n∑z=1

P (Ozi|Si, Θi)

P (Oz(i−1)|Si−1, Θi−1)(3–35)

51

Where the n samples have been drawn from the proposal distribution P (Ozi−1|Si−1, Θ).

For the estimation of ui we need to combine the two terms

uzi =piP (Ozi|Si, Θi)∑Ii=1 piP (Ozi|Si, Θi)

n∑z=1

P (Ozi|Si, Θi)

nP (Oz(i−1)|Si−1, Θi−1)(3–36)

To compute the M-step

pi =

∑nz=1 uzi∑I

i=1

∑nz=1 uzi

=

∑nz=1 uzi

n(3–37)

Borrowing from the competitive nature of the BM-HMM, winners are chosen based

on the same criterion of minimizing the Euclidean distance for the classes for the

LM-HMM.

3.5 Dependently Coupled HMMs

Up to this point, a variety of HMM models have been discussed that each determine

a coupling relationship between multiple channels. The model discussed in this section

will not only couple multiple HMM models, it will also directly characterize the coupling

relationships. Specifically, a new formulation to the Coupled Hidden Markov Model

(CHMM) (as described by [44]) will be used, which will be referred to as Dependently

Coupled Hidden Markov Model (DC-HMM) in order to maintain nomenclature. With

this formulation the joint conditional probability formulation is modeled as a linear

combination of marginal conditional probabilities with the weights represented

by coupling coefficients. Although some DC-HMM formulations have shown to be

computationally expensive and nearly intractable [44, 46] this new formulation alleviates

some of those obstacles with a simplistic approximation. Although computational

complexity is increased beyond the LM-HMMs, a beneficial insight is gained since the

underlying structure within the neural data is revealed through the model’s parameters.

Consequently there may be an opportunity to exploit the underlying structure for

modeling purposes or gain an understanding into the neurophysiologic interactions.

52

The next section will cover the related work on DC-HMMs and contrast the benefits

of Zhong’s [43] formulation. From there, the modeling framework is detailed as well

as the forward procedure. This forward procedure is important for the discussion on

clustering analysis and Neurophysiologic understanding in Chapter 6. Finally, the

learning algorithm for this DC-HMM formulation is presented with a discussion on the

coupling coefficients and how they characterize the coupling between channels.


The fully coupled architecture is powerful in modeling interactions among multiple

sequences. The joint transition probability is modeled as

P (S(c)t |S(1)

t−1, S(2)t−1, , , S

(C)t−1) =

C∑c=1

(Θc,cP (S(c)t |S(c)

t−1)) (3–38)

where Θc,c is the coupling weight from model c to model c, i.e. how much S(c)t−1 affects the

distribution of S(c)t . Which is controlled by P (S

(c)t |S(c)

t−1). Essentially, the joint dependency

is modeled as a linear combination of all marginal dependencies [43].

This formulation reduces the number of parameters compared to the standard

DC-HMM [43]. Of course this model still contains more parameters than multiple

standard HMMs and computationally more expensive than the LM-HMMs but by

computing the coupling relationships between neural channels there may be beneficial

insight into the microstructure in the neural data. The additional parameters are incurred

with the C2 transition probability matrices compared to only C in C standard HMMs.

There is also an additional coupling matrix Θ. Since the output symbols necessary for

the neural data is much larger than the number of hidden states, the increase in the

number of transition matrices does not increase the model complexity dramatically. As

compared to the standard formulation of the fully-coupled HMMs, this formulation is

easier to implement.

The DC-HMM model is characterized by the quadruplet λ = (π,A, B, Θ), where Θ

is the interaction parameters between channels (as shown in Figure 3-5). Assume there

53

are C coupled HMMs, the parameter space consists of the following components (with

stochastic constraints)

1. prior probability π = (π(c)j ), 1 ≤ c ≤ C, 1 ≤ j ≤ N (c)

N(c)∑j=1

π(c)j = 1 (3–39)

2. transition probability A = (a(c,c)ih ), 1 ≤ c, c ≤ C, 1 ≤ i ≤ N (c), 1 ≤ j ≤ N (c)

N(c)∑j=1

a(c,c)ij = 1 (3–40)

3. observation probability B = (b(c)j (k)), 1 ≤ c ≤ C, 1 ≤ j ≤ N (c), 1 ≤ k ≤ M

M∑

k=1

b(c)j (l) = 1 (3–41)

4. coupling coefficient Θ = (Θc,c), 1 ≤ c, c ≤ C

C∑c=1

Θc,c = 1 (3–42)

The forward-backward computation is illustrated in Figure 3-5. At each time slice

there are NC (if assume N (c) = N ) β’s, which is an exponential number with respect

to C. The computation complexity would be TNC thereby making it impractical to

compute the forward-backward variables for any realistic number of channels. To reduce

the computational complexity a modified forward variable is calculated for each HMM

model separately. The modified forward variable reduces complexity to O(TCN2) and is

calculated inductively as

a) Initialization:

α(c)1 (j) = π

(c)j b

(c)j (o

(c)1 ), 1 ≤ j ≤ N (c) (3–43)

54

Figure 3-5. DC-HMM trellis structure

b) Induction:

α(c)t (j) = b

(c)j (ot)

C∑c=1

Θc,c

N(c)∑i=1

(αct−1(i)a

(c,c)ij ), 2 ≤ t ≤ T (3–44)

c) Termination:

P (O|λ) =C∏

c=1

P (c) =C∏

c=1

(N(c)∑j=1

α(c)T (j)) (3–45)

Experimental results show that calculated P (O|Θ) is close to the true value of

P (O|Θ) [43].

3.5.2 Training

This subsection details the iterative optimization procedure for learning the

DC-HMM parameters using Zhong’s formulation. The proposed way to solve the

55

optimization problem is by using Lagrange multipliers. This constrained optimization

technique leads to an iterative re-estimation solution for all of the parameters. To simplify

the writing, only the transition matrix A will be discussed. Similar calculations are

applied to the remaining parameters π, B and Θ. Let L be the Lagrangian of P with

respect to the constraints associated with A

L = P +∑i,c,c

λ(c,c)i (

N∑j=1

a(c,c)ij − 1) (3–46)

where the λ(c,c)i are the undetermined Lagrange multipliers. P is locally maximized

when

a(c,c)ij =

a(c,c)ij ∂P/∂a

(c,c)ij∑N(c)

k=1 a(c,c)ik /∂a

(c,c)ik

(3–47)

While the likelihood function P (O|λ) is more complicated in this DC-HMM

formulation than in standard HMM case, P is still a homogeneous polynomial with

respect to each type of parameters (namely, π, A, B and Θ) and each type of the

parameters is still subject to stochastic constraints [43].

Overall, the re-estimation formula is guaranteed to converge to a local maxima

[43, 66]. The only difference from the standard HMM case is in calculating the first

derivative of the likelihood function. With standard HMMs, these derivatives reduce to

a form in which only the forward and backward variables are needed [43, 66]. In the

above formulation, the derivatives are calculated using back propagation through time.

Fortunately, computational complexity is not significantly increased since similar forward

procedures are applicable to the calculation of the derivatives. The detailed computation

of these first derivatives is presented in Appendix B.

56

CHAPTER 4BRAIN MACHINE INTERFACE MODELING: RESULTS

Although one of the goals for the dissertation is to find unsupervised neural

structures that correspond to underlying kinematic primitives, performance metrics

must first be decided in order to establish baseline results for the models. In the

following chapters, these metrics will serve as a way to compare the performance of

the unsupervised results. This type of comparison is necessary since there are no

known ground truths in real data.

Two metrics can be established by looking at the main BMI applications. The first

metric is classification performance. From a goal-oriented BMI perspective, classification

performance is an important metric in distinguishing between models that represent

goals or simple sub-goals. Although classification is limited with real neural data since

the classes are imposed by the user (using subjective kinematic features), this metric

will be useful for the simulated data since the classes are known. The second metric,

Correlation Coefficient (CC), is derived from current BMI research into trajectory

reconstruction. Most BMI researchers use the correlation coefficient between the

predicted trajectory of their models and the animal’s true arm trajectory in order to

assess the performance. Although the models discussed in this dissertation represent

discrete motion primitives or neural state structures, by treating these multiple models

as a front-end switch to respective Wiener filters, the continuous trajectory can be

reconstructed. Therefore the correlation coefficient becomes applicable as a metric.

Essentially, if the correlation coefficient increases, i.e. trajectory reconstruction is

improved, then the partitioning of the input space has merit.

This chapter will first present classification results on class labels obtained from

kinematic features. In order to assess the importance of these class labels and their

ability to partition the input space, continuous trajectories will be reconstructed in a

fashion similar to a Mixture of Experts framework (Figure 4-1): first the input data is

57

classified, then the neural data for the winning class trains a respective Wiener filter

(as discussed in Chapter 1). If the class labeling is appropriate, each Wiener filter will

develop a good model for a piece of the trajectory. Subsequently, performance should

improve beyond a single filter trained with the full trajectory. If the classification is bad,

the Wiener filters will mix the segments and performance should not be distinguishable

from a single Wiener filter since the individual filters will each generalize over the full

input space. Essentially all the individual filters will be equivalent, nullifying the effect of

switching between them.

Figure 4-1. Multiple model methodology

58

Table 4-1. Classification results(BM-HMM selectedchannels)

Model #channels % correctWith monkey data

IC-HMM 104 92.4%IC-HMM 9 87.1%BM-HMM 9 92.0%

Linear 104 88.3%Linear 9 86.9%

Table 4-2. Classification results(LM-HMM selectedchannels)


IC-HMM 104 92.4%IC-HMM 9 89.5%LM-HMM 9 92.1%


4.1 Monkey Food Grasping Task

4.1.1 Boosted and Linked Mixture of HMMs

The BM-HMMs and LM-HMM are both initialized three hidden states and an

observation sequence length T = 10, which corresponds to one second of data (given

the 100ms bins). These choices are based on previous efforts to optimize performance

[16]. For the linear classifier, a Wiener filter with a 10-tap delay (that corresponds to one

second of data) is used followed by a threshold for classification. Ten taps are used to

be consistent with the HMM work. For the monkey food grasping task, 8000 samples are

used to train the models while 2000 samples are used as a cross validation set to select

the channels. A separate test set of 5000 samples is used for the classification results

above. These parameters and thresholds (for the linear model) were chosen empirically

from multiple Monte-Carlo runs, based on previous work. In earlier work, ”Leave-K-Out”

methodologies were employed to ensure that the results on these data sets are general

[25].

To provide a fair comparison between the methods, the same neural channels that

were chosen by the BM-HMM are also used with the linear model and the IC-HMM.

Similarly, for the LM-HMM, a different set of neurons (although some overlap the

BM-HMM) were selected. The selection of neurons is based on the algorithm while the

number of neural channels is imposed in order to have equal number of channels for

comparison.

59

Tables 4-1 and 4-2, shows a comparison in results between the BM-HMM and

LM-HMM versus the full IC-HMM and the linear classifier created. The BM-HMM and

LM-HMM perform on par with the IC-HMM, but with the added benefit of dimensionality

reduction. Essentially, the same performance is obtained with a fraction of the number of

neural channels. This dimensionality reduction is very important when considering the

hundreds to thousands of channels that will be acquired in future BMI experiments [10].

In order to understand how the two methods differ in exploiting the channels,

Figures 4-2 and 4-3 show the correlation coefficients between the subset of channels

chosen by the BM-HMM and LM-HMM. Interestingly, more of the channels in the

LM-HMM have a large positive and large negative correlation (with respect to the move

class). In contrast, the BM-HMM has some positive correlation amongst its subset, but

few channels exhibit negative correlation.

Figure 4-2. Correlation coefficients between channels (monkey moving)

Tables 4-3 and 4-4 show the effect of randomly selecting the experts. Notice how

the results in these tables are poor in comparison to Tables 4-1 and 4-2. In particular,

Tables 4-1 and 4-2 show a significant decrease in performance when modeling the

monkey data. Interestingly, αm converges to similar values (like an averaging) for both

models. This averaging effect could be due to the dependency of randomly selected

neurons, (i.e. more independent). Since many of the channels have low firing rates

60

Figure 4-3. Correlation Coefficients between channels (monkey at rest)

or may not even correspond to the particular motions, some of these unimportant

neurons may also reduce performance. The averaging effect is further confirmed by

the results from the IC-HMM since similar performance is achieved with respect to

both models in Tables 4-3 and 4-4. Since the IC-HMM is an averaging model used for

independent neurons, similar performance on the random subsets would indicate that

dependencies are not being exploited by the models. Figure 4-4 shows the correlation

coefficient between the randomly selected channels. Since the colors are all light green,

the correlation between the channels is low to zero, further alluding to independence

between these particular channels. The results further suggests that the LM-HMM is

better at exploiting stronger correlated channels (i.e. dependent) than the BM-HMM, and

much stronger than randomly selecting channels (which may exhibit more independence

between channels).

Additionally, Figure 4-5 demonstrates an expert adding experiment in which the best

ranked experts are added one by one to the ensemble vote. As the BM-HMM chains

are added in Figure 4-5, an interesting result emerges, the error rate quickly decreases

below the IC-HMM error when applied to the monkey neural data. The LM-HMM shows

a faster drop in error, but does not achieve the same result as the BM-HMM. Overall,

61

Figure 4-4. Correlation coefficient between channels (randomly selected for monkey)

the boosted mixtures and linked mixtures are exploiting more useful and complimentary

information for the final ensemble than the simple IC-HMM.

Figure 4-6 presents the peri-event time histogram for all the neural channels in

parallel. These histograms present the firing rate of the different neurons averaged

across a single event (i.e the start of a movement trial) [71]. For these figures, the

firing rate is normalized so that the light colors illustrate decreased firing rates and the

darker colors indicate increased firing rates. Overlaid and stretched on each image

Table 4-3. Classification results(random BM-HMMselected channels)




Table 4-4. Classification results(random LM-HMMselected channels)




62

Figure 4-5. Monkey expert adding experiment

is the average of the movement trials. Interestingly, some channels have no pattern

associated with them whereas others have a consistent pattern. Also notice how the

BM-HMM selects neurons that display both an increase in firing activity as well as

neurons that decrease their firing during the onset of movement. The LM-HMM has also

been observed to use neurons that reduce their firing rate during movement. Overall,

linear or non-linear filters may not be able to take advantage of these types of neurons

whereas these graphical models incorporate the information as simply different states.

Overall the results demonstrate three interesting points. First, nine BM-HMM and

LM-HMM chains outperform the linear classifier that uses the full input space on the

monkey data. Second, the subset of experts that are chosen by the BM-HMM perform

well on the linear model. This result is expected since the BM-HMM chains select

neural channels with important complimentary information. Third, when comparing

the BM-HMM and LM-HMM to the linear classifier and the IC-HMM using the same

subset of neural channels, the results show that the hierarchical training of the BM-HMM

63

Figure 4-6. Parallel peri-event histogram for monkey neural data

and LM-HMM provides a significant increase in performance. This increase is related

to the dependencies that are being exploited during each round of training, where

the other models simply try to uniformly combine all the neural information into a

single hypothesis. Tables 4-3 and 4-4 support this hypothesis since the LM-HMM and

BM-HMM default to an independent approximation. Finally, other BMI researchers apply

sensitivity analysis to understand the importance of a neural channel respective to the

kinematics performed by the subject. In contrast the BM-HMM and LM-HMM channel

selection is trying to improve classification results by exploiting dependencies between

channels as well as kinematics. Interestingly, some of the channels selected in both data

sets do overlap some of the same neurons selected during sensitivity analysis [20, 67].

4.1.2 Dependently Coupled HMMs

Using the same classes as described for the BM-HMM and LM-HMM, a DC-HMM

(as described earlier) is trained for each class with 5000 data points. Based on empirical

testing with multiple Monte Carlo simulations the observation length is set to ten

64

(1 sec of data), the number of hidden states to three. The chosen neurons are the

same as those selected by the LM-HMM. In order to show the benefit of partitioning

the input space with the DC-HMM, Wiener filters (one per class) are used again in a

second-phase to reconstruct the trajectory. These filters are trained as described in

Appendix A, with 10 tap delays. Table 4-5 shows the correlation coefficient results after

using the DC-HMM to partition the input space for trajectory reconstruction by respective

Wiener filters. The table shows that the DC-HMM with 6 neurons does not perform as

well as the LM-HMM but is far better than using random labels or a single Wiener filter.

The results in the table also show that adding more neurons improves the correlation

coefficient (but not beyond the LM-HMM).

Table 4-5. Correlation coefficient using DC-HMM on 3D monkey data

Experiment CC

DC-HMM (6 Neurons) .81±.18

DC-HMM (29 Neurons) .82±.15

LM-HMM .84±.13

Single Wiener .76±.19

NLMS .75±.20

TDNN .77±.17

Table 4-6. NMSE on 3D monkey data

Experiment NMSE

DC-HMM (6 Neurons) .26

DC-HMM (29 Neurons) .23

LM-HMM .22

Single Wiener .36

Figure 4-7 shows the DC-HMM reconstruction of the kinematic out of the food

grasping task where as Figure 4-8 shows the true trajectory. Qualitatively the trajectories

are similar, confirming the correlation coefficients shown above. Additionally, the

normalized mean squared error (NMSE) was computed for the different models showing

65

the graphical models outperforming the single Wiener filter. A t-test was applied to

the output of the DC-HMM based bi-model model and the single Wiener filter with a

significance level of 0.05. The null hypothesis is rejected , resulting in the p-value of

0.02.

Figure 4-7. Supervised monkey food grasping task reconstruction (position)

Figure 4-9 shows the Viterbi hidden state path for the different neurons across

three hidden states. Essentially, the three dimensions (X,Y, and Z) of the monkey’s

arm is overlaid over the state paths. On the Y axis there are 24 states (6 neurons with

four states) at each time bin. The dark lines indicate the largest α (equation 3–9) at

each time bin. The left figure is the Viterbi path for the DC-HMM trained on movement

data while the right figure is for the data when the monkey is at rest. The figures depict

repeating patterns that correspond to the kinematics. Also notice that the states from

different neurons are dominating across the data set for each of the classes. Although

not pertinent to the results in this chapter, these Viterbi paths inspire further exploration

in Chapter 6.

66

Figure 4-8. 3D monkey food grasping true trajectory

Figure 4-9. Hidden state space transitions between neural channels (for move and rest)

67

Figure 4-10 shows the coupling coefficients for each class (move/rest) for the six

neurons in the subset. The coupling coefficient is from the previous neuron’s state to the

next. Brighter colors indicate a larger value for the coupling coefficient. The left figure

presents the DC-HMM trained with movement data while the right is for the model of

rest. The diagonal that forms in the figure is the coupling coefficient between the same

neurons (indicating preference for staying in the same state or neuron). Despite some

neurons illustrating weak dependency, there is stronger evidence for independence

between neurons. This result may help to explain why even the linear filters are able

to work well with these neurons (since DC-HMM has less to exploit). Also notice a few

neurons dominate the models which reinforces what is shown in the previous Figure 4-9.

Figure 4-10. Coupling coefficient between neural channels (3D monkey experiment)

4.2 Rat Single Lever Task

For the following experiments, the HMMs had three hidden states and an observation

sequence length T = 10, which corresponds to one second of data (given the 100ms

bins). For the linear classifier, a Wiener filter with a 10-tap delay (that corresponds to

68

one second of data) is used. For the Rat lever task, 5000 samples are used to train the

models while 2000 samples are used as a cross validation set to select the channels. A

separate test set of 3000 samples are used for the classification results above. These

parameters and thresholds (for the linear model) were chosen empirically from multiple

Monte-Carlo runs and based on previous work. As with the monkey grasping task,

”Leave-K-Out” methodologies were employed to ensure generalization of the results on

these data sets.

The same neural channels that were chosen by the BM-HMM for the Rat lever

task are also used with the linear model and the IC-HMM in order to provide a fair

comparison between the methods. Similarly, for the LM-HMM, a different set of neurons

(although some overlap the BM-HMM) were selected based on the algorithm but the

number of final channels were kept the same.

Table 4-7. Classification results(BM-HMM selectedchannels)

Model #channels % correctWith rat data



Table 4-8. Classification results(LM-HMM selectedchannels)




Tables 4-7 and 4-8 shows a comparison of the classification results from the

BM-HMM and LM-HMM versus the full IC-HMM and a simple linear classifier created

by a regression model followed by a threshold [25]. Comparing these tables with 4-1

and 4-2, the classification performance on monkey data is better than on the rat data.

The more important point is that again the BM-HMM and LM-HMM perform on par with

the IC-HMM, but with the added benefit of dimensionality reduction. Effectively, the

same performance is obtained with a fraction of the number of neural channels.

69

Figure 4-11, the parallel peri-event histogram, shows that the models also capture

rat neurons that dramatically reduce their firing rate during movement.

Figure 4-11. Parallel peri-event histogram for rat neural data

Tables 4-9 and 4-10 show the effect of randomly selecting the experts. Notice the

results in these tables are poor in comparison to Tables 4-7 and 4-8. The performance

reduction is less pronounced for the data collected from the Rat than from the Monkey

food grasping task. The discrepancy is likely due to the available number of channels

that are randomly selected. Since the monkey data is collected from a greater number of

channels, there is a higher likelihood of selecting less important neurons for training.

Figure 4-12 presents the results from an expert adding experiment in which the

best ranked experts are added one by one to the ensemble vote. The figure shows

that as the BM-HMM chains are added, an interesting result emerges, the error rate

quickly decreases below the IC-HMM error when applied to the rat neural data. The

LM-HMM has an even faster drop in error, but does not achieve the same result as the

BM-HMM. Similar to the results for the monkey food grasping task, the boosted mixtures

70

Table 4-9. Classification results(random BM-HMMselected channels)




Table 4-10. Classification results(random LM-HMMselected channels)




and linked mixtures are exploiting more useful and complimentary information for the

final ensemble than the simple IC-HMM for the Rat lever task.

Figure 4-12. Rat expert adding experiment

71

4.3 Monkey Cursor Control Task

4.3.1 Population Vectors

The last two sections demonstrated that partitioning the input is beneficial with

simple go-no-go experiments for both animal experiments (rat and monkey). The focus

now turns to more complicated motion primitives other than movement and rest. To

accomplish this goal, it is first important to understand what information the neurons

provide about the different arm kinematics. Specifically, motor cortex neurons are known

to initiate motor commands that then dictate the limb kinematics. Therefore this section

will focus on movement direction since it is the most natural way to navigate to a goal[9].

The population vector method [14] was one of the first methods that tried to address the

complicated relationship between movement direction and motor cortical activity. This

method links high neural modulation with preferred movement direction.

In order to implement the population vector, tuning curve statistics are computed

for each of the neurons. These tuning curves provide a reference of activity for different

neurons. In turn, the neural activity relates to a kinematic vector, such as hand position,

hand velocity, or hand acceleration, often using a direction or angle between 0 and 360

degrees. A discrete number of bins are chosen to coarsely classify all the movement

directions. For each direction, the average neural firing rate is obtained by using a

non-overlapping window of 100ms. The preferred direction is computed using circular

statistics as

circular mean = arg(∑N

rNeiΘN ) (4–1)

where rN is the neuron’s average firing rate for angle ΘN , and N covers the full

angular range. Figure 4-13 shows an example polar plot of four simulated neurons and

the average tuning information with standard deviation across 100 Monte Carlo trials

evaluated for 16 min duration. The computed circular mean, estimated as the firing

rate weighted direction, is shown as a solid red line on the polar plot. The figure clearly

72

indicates that the different neurons fired more frequently toward the preferred direction.

Additionally, in order to get the statistical evaluation between Monte Carlo runs, the

traditional tuning depth were not normalized to (0, 1) for each realization as normally

done in real data. To calculate the tuning depth:

Tuning Depth =max(rN)−min(rN)

std(rN)(4–2)

Figure 4-13. Neural tuning depth of four simulated neurons

After preferred directions are acquired, the vectors of angles from predicting

neurons are added to create the trajectory in the population vector method. Although the

results are far from great [14], this type of analysis may lend some insight into a better

segmentation of the input space (relative to angular directions).

73

Figure 4-14. Histogram of 30 angular velocity bins

Since angular values exist on the real line, it is necessary to quantize or bin

the angles. Figure 4-14 shows the histogram for the 30 different angular bins and

the number of examples per bin (i.e. the polar plot is stretched horizontally). Each

bin represents a range of angular velocities (12◦). The figure also presents two

distinguishable peaks in the histogram, this is due to monkey’s hand movement making

predominantly diagonal movements (figure 4-15).

4.3.1.1 A-priori class labeling based on population vectors

With a method that links neural modulation to particular angles, it would interesting

to see if the binned angles themselves are appropriate as a class label. Since some

of the neurons are modulated across many angular bins, a method must be found to

compare the tuning curves. Using the analysis described in the last section, Figure 4-16

provides a comparison between the tuning curves of the neurons drawn in parallel.

Essentially each neuron’s tuning depth is drawn horizontally (x-axis) rather than on a

polar plot (angular bins start from left 0◦ to right 360◦). The depths are then plotted in

74

Figure 4-15. 2D angular velocities

parallel along the y-axis (185 neurons). The darker pixels indicate a higher firing rate for

a particular angle. Since all the tuning depths are normalized, the figure clearly shows

that some neurons are tuned to different angles relative to other neurons. Figure 4-16B

is a plot of the maximum depth at a particular angular bin. Interestingly, Figure 4-16B

shows that a small subset of neurons have a very high firing rate (normalized) during

particular angles. Interestingly, some neurons modulate across multiple bins (i.e. a wide

range of angles).

4.3.1.2 Simple nave classifiers

In this section, two simple winner-take-all experiments are conducted in order to

test if the angular bins could serve as class labels. For the first experiment,the mean

and variance are computed for the firing rate for each neuron (and each angular bin)

with an assumed Gaussian distribution. At each time instance, the largest probability

is found for a particular angular bin consequently assigning the particular time bin with

the respective angle. In experiment two, the mean and variance are also computed on

75

Figure 4-16. A. Parallel tuning curves B. Winning neurons for particular angles

the firing rate for each neuron (with Gaussian assumption). Departing from experiment

one, the probability of each angular bin is treated as a marginal for each neuron (at

each time instance). The largest joint probability is then found for each angle to base

the classification decision. For both experiments, classification performance is the only

metric considered.

Unfortunately the results from both experiments are poor. In experiment one; the

angular bins could only be correctly classified 10% of the time. Similarly, experiment

two produces poor results. Additionally, classification does not significantly improve

when more quantized bins are used. Figure 4-17 shows the histogram for 10 Bins

(36 ◦). Figures 4-18A, and 4-18B demonstrate similar tuning but at a more quantized

level. Notice in Figure 4-18B that fewer neurons are covering all of the angles (which

is to be expected with larger quantized bins). Overall, these results show that simple

modulations are not enough to classify the neural input and that more complicated

modeling structures are necessary.

4.3.2 Results for the Cursor Control Monkey Experiment

Although segmenting the neural data into angular velocities did not work in the

simple winner take-all experiments above, the next step is to test the hypothesis

that the kinematic reconstruction improves if the input is segmented into the binned

76

Figure 4-17. Histogram of 10 angular bins

Figure 4-18. A. Parallel tuning curves B. Winning neurons for particular Angles

angular velocities. Although segmenting the neural input is known to improve trajectory

reconstruction with multiple models on food reaching tasks, this methodology has

not been tested with continuous 2D trajectories. By using angular velocities for

segmentation, the LM-HMMs and DC-HMMS should capture the neurons that are

modulating for particular angles. With these models isolated for a particular angular

77

velocity, performance should improve. Although this methodology is supervised, a

baseline is established for the unsupervised learning and will serve to reinforce the

hypothesis that partitioning the input space is beneficial. On a side note, the number of

samples per class and the predominate kinematic features (per class) are imposed onto

the generative models when training for these supervised experiments. Even though the

angular velocity serves as the predominate kinematic feature in these experiments, later

in unsupervised experiments; the imposed features may not be predominate features

that the generative models capture.

Table 4-11. Correlation coefficient using different BMI models

Experiment CC(X) CC(Y)

NMCLM(FIR) .67±.03 .48±.07

NMCLM(Gamma) .67±.02 .47±.07

Single Wiener .66±.02 .48±.10

NLMS .68±.03 .50±.08

Gamma .70±.02 .53±.09

Subspace .70±.03 .58±.10

Weight Decay .71±.03 .57±.08

Kalman .71±.03 .58±.10

TDNN .65±.03 .51±.08

Before describing the experiments in detail, Table 4-11 shows the correlation

coefficients for the different BMI models that have been used on this particular cursor

control dataset. Since the following experiments use the same data set, it is appropriate

to compare these correlation coefficients to those produced by our methods. Of

particular interest is the Non-Linear Mixture of Competitive Linear Models (NMCLM)

since the methodology for constructing the trajectory is similar to above discussed

multiple-model approach. Specifically, the NMCLM divides the input space and applies

a switching mechanism to select a winning Wiener filter to construct the trajectory in a

piecewise fashion [19].

78

For the following experiments, several angular binned velocities are used as the

class label. The neurons that are modulating for particular angles will be isolated and

help in training Wiener filters that learn only a homogeneous portion of the input/output

space (corresponding to an angular bin). Based on empirical and previous results,

ten taps are used for both the single filter and multiple filters (one filter per class).

Subsequently this amount of data corresponds to one second of time (ten 100ms

time bins). The one second of data is comparable to the time embedding used by

the methods in Table 4-11. The correlation coefficient of the reconstructed trajectory

provides a metric and baseline comparison for the unsupervised results in Chapter 5.

For the experiments below, a training set of 5000 samples is used along with a

test set of 3000 samples. Training with the classes is a two-step process. First the

LM-HMM and DC-HMM are trained on the segmented data and then the parameters

are frozen. Then the linear models are trained with neural data that has been classified

as a particular class by the LM-HMM or DC-HMM. The weights are then frozen for the

linear models. During testing, the LM-HMM and DC-HMM are computed with the test set

and switches the input to the corresponding linear filter which then computes the output

trajectory. As mentioned, the angular bins can be quantized into many angular bins,

through empirical testing four classes or angular bins were selected.

With respect to the other models used on this cursor control task, Table 4-12

demonstrates that again the DC-HMM is not quite as good as the LM-HMM but both

are still better than the single Wiener filter and TDNN (while dramatically better than the

NMCLM).

Figure 4-19 shows the reconstruction (red) versus the true trajectories (blue) of

the monkey cursor control task. Overall, there is some benefit in using the DC-HMM.

Perhaps the subdued results are due to an independent set of neurons acquired in the

data set. With such data, the DC-HMM would not be as much benefit since it could not

exploit as much information.

79

Figure 4-20 shows the Viterbi state transitions for the different neurons chosen

for the DC-HMM. Along the Y-axis are the different neural states (24 total: four states

for each six neurons). Each subplot represent one of the four different classes. Notice

how different neurons are preferred over others with respect to each class. Some

neurons predominately transition to themselves rather than other neurons. These

predominant transitions were also observed in the last section with the food grasping

task. Unfortunately, discerning a repeating pattern is difficult with this dataset since the

cursor task is not repetitive like the food grasping task.

Figure 4-21 further confirms that most of the neurons are only coupled to themselves

more so than the other neurons. The figure shows the four different classes and the

coupling coefficients between neurons.



DC-HMM .74±.08 .67±.13

LM-HMM .79±.04 .75±.11


TDNN .65±.03 .51±.08

NMCLM(FIR) .67±.03 .50±.07

NMCLM(Gamma) .67±.02 .47±.07

80

Figure 4-19. True trajectory and reconstructed trajectory (DC-HMM)

Figure 4-20. Hidden state transitions per class (cursor control monkey experiment)

81

Figure 4-21. Coupling coefficient between neurons per class(cursor control monkeyexperiment)

82

CHAPTER 5GENERATIVE CLUSTERING

5.1 Generative Clustering

Chapter 4 demonstrated that by using simple angular bins or movement/rest as

class labels to partition the neural input, the trajectory reconstruction is remarkably

improved. Consequently, the improvements provided evidence that the neural activity in

the motor cortex transitions through multiple states during movement [13, 35] and that

the switching behavior is exploitable for BMIs.

Unfortunately, to partition the neural input space, a-priori class labels are needed

for separation. However, under real conditions with paraplegics, there are no kinematic

clues to separate the neural input into class labels (or clusters). Currently, most of the

behaving animals engaged in BMI experiments are not paralyzed, allowing the kinematic

information to be used for training the models. This Achilles heel plagues most BMI

algorithms since they require kinematic training data to find a mapping to the neural

data.

Since kinematic clues are not available from paraplegics, neural data must be

exclusively used to find a separation. Finding neural assemblies or structures may

offer a solution. The hypothesis argued throughout this dissertation is that there are

multiple neural structures corresponding to motion primitives. Initial supervised results

support this hypothesis[16]. Therefore the goal is to find a model that can learn these

temporal-spatial structures or clusters and segment the neural data without kinematic

clues or features (i.e. unsupervised). In this chapter, the LM-HMM and DC-HMM

models are combined with a clustering methodology in order to cluster neural data.

These models are chosen for their ability to operate solely in the input space and their

ability to characterize the temporal spatial space at a reduced computational cost.

The methodology described in the next section will explain how the models learn the

83

parameters and structure of the neural data in order to provide a final set of class or

cluster labels for segmentation.

Clustering framework.

This section establishes a model-based method for clustering the spatial-temporal

neural signals using the LM-HMM or DC-HMM. In effect, the clustering method tries

to discover a natural grouping of each exemplar S (i.e. window of multidimensional

neural data) into K clusters. A discriminate (distance) metric similar to K-means is used

except that the vector centroids are now probabilistic models (LM-HMMs or DC-HMMs)

representing dynamic temporal data [74].

The bipartite graph view (Figure 5-1) assumes a set of N data objects D (e.g.,

exemplars, represented by S1, S2, .., SN , and K probabilistic generative models (e.g.,

LM-HMMs or DC-HMMs), λ1, λ2, ..., λK , each corresponding to a cluster of exemplars (i.e

windows of data) [75]. The bipartite graph is formed by connections between the data

and model spaces. The model space usually contains members from a specific family of

probabilistic models. A model λy can be viewed as the generalized ’centroid’ of cluster y,

though it typically provides a much richer description of the cluster than a centroid in the

data space. A connection between an object S and a model λy indicates that the object

S is being associated with cluster y, with the connection weight (closeness) between

them given by the log-likelihood log p(S|λy).

A straightforward design of a model-based clustering algorithm is to iteratively

retrain models and re-partition data objects. Essentially, clustering is achieved by

applying the EM algorithm to iteratively compute the (hidden) cluster identities of data

exemplars in the E-step and estimate the model parameters in the M-step. Although

the model parameters start out as poor estimates, eventually the parameters converge

to their true values as the iterations progress. The log-likelihoods are a natural way to

provide distances between models as opposed to clustering in the parameter space

(which is unknown). Basically, during each round, each training exemplar is re-labeled

84

Figure 5-1. Bipartite graph of exemplars (x) and models

by the winning model with the final outcome retaining a set of labels that relate to

a particular cluster or neural state structure for which spatial dependencies have

also been learned. The dependency structure is learned during the inner-loop of the

LM-HMM or DC-HMM training.

Setting the parameters is daunting since the experimenter must choose the number

of states, the length of the exemplar (window size) and the distance metric (e.g.

log-likelihood). To alleviate some of these model initialization problems, previous

parameter settings found during early work are used for these experiments [16].

Specifically, an a-priori assumption is made that the neural channels are of the same

window size and same number of hidden states. The clustering framework is outlined

below:

Let data set D consist of N sequences for J neural channels, D = S11 , ..., S

JN ,

where Sjn = (Oj

1, ...OjT ) is a sequences of observables length T and Λ = (λ1, ...λK) a

85

set of Models. The multiple sequences in a window of time (size T) is referred to as an

exemplar. The goal is to locally maximize the log-likelihood function

logP (D|Λ) =∑

Sji∈D

logP (Sjn|λy(Sj

n)) (5–1)

1. Randomly assign K labels (with K < N ), one for each windowed exemplar Sn,

1 ≤ n ≤ N . The LM-HMM parameters are initialized randomly.

2. Train each assigned model with the respective exemplars using the LM-HMM

or DC-HMM procedure discussed earlier. During this step the model learns the

dependency structure for the current cluster of exemplars.

3. For each model evaluate the log-likelihood of each of the N exemplars given model λi,

i.e., calculate Lin =log L(Sn|λi), 1 ≤ n ≤ N and 1 ≤ i ≤ K. y(Sn) =argmaxy log L(Sn|λi)

is the cluster identity of the exemplar. Then re-label all the exemplars based on cluster

identity to maximize equation 5–1.

4. Repeat steps 2 and 3 until convergence occurs or until a percentage of labeled

exemplars does not change (i.e. set a threshold for changing exemplars). More

advanced metrics for deciding when to stop cluster could be used (like KL divergence

etc).

5.2 Simulations

5.2.1 Simulated Data Generation

Since there are no known ground truths to label real BMI neural data, simulations

on plausible artificial data will help support the results found by the clustering framework

on real data.

The first series of simulations will consist of independent neurons generated from

a realistic neural model. Although many neural models have been proposed [76, 77],

the Linear-Nonlinear-Poisson (LNP) model is selected for the following simulations since

different tuning properties are selectable and can therefore generate more realistic

86

neural data. The LNP model consists of three stages. The first is a linear transformation

that is then fed into a static non-linearity to provide the conditional firing rate for a

Poisson spike generating model at the third stage [76, 77].

Two simulated independent neural data sets are generated in the following

experiments. One data set contains four neurons tuned to two classes (Figure 5-2)

and a second data set contains eight neurons tuned to four classes (one pair of neurons

per class). To create these datasets first, velocity, time series is generated with 100

Hz sampling frequency and 16 min duration (1000000 samples totally). Specifically,

a simple 2.5 kHz cosine and sine function is used to emulate the kinematics (X-Y

Velocities) for the simulation experiments. Then the entire velocity time series (for both

data sets) is passed through a (LNP) model with the assumed nonlinear tuning function

in Equation 5–2.

λt = exp(µ + β~vt~Dprefer) (5–2)

where λt is the instantaneous firing probability, µ is the background firing rate (set to

.00001). The variable β represents the modulation factor for a preferred direction which

is set monotonically from 1 to 4 for the four neurons in the two class simulation and a

value of 3 for the eight neurons in the four class simulation. The unit vector ~Dprefer is the

preferred angular direction of the kinematics which is set to π4

and 5π4

for the two class

simulation and π4, 3π

4, 5π

4, 7π

4for the four class simulation. The spike train is generated

by an inhomogeneous Poisson spike generator using a Bernoulli random variable with

probability λ(t)∆t within each 1ms time window. Once the spike trains are generated,

they are binned into 100ms bins while the velocity ~vt data is down-sampled accordingly.

For each data set, an additional 100 channels of uniformly distributed fake spike

trains (also 16mins each) are combined with the two data sets to create an artificial

neural data set with a total of 104 and 108 neurons. Essentially, adding the extra fake

channels allows for less than an 8% chance for the true channels to be randomly

selected by the generative models.

87

The second series of simulations consist of dependent neurons in which the

temporal state transitions and dependencies are pre-defined. To produce samples, a

graphical model similar to the DC-HMM is initialized for each desirable class/cluster.

The parameters are selected so that overlap between the models of each class/cluster

is reduced. A Gibbs sampler is then employed on each respective model to generate

exemplars of data. Although the Gibbs sampler is not the only sampler available, it is

very easy to implement, since sampling is achieved with the conditional distributions

between the nodes rather integrating over the joint (since the conditional distributions

are pre-defined) [79]. After all the samples are generated for each respective class,

the exemplars from each class are artificially placed in an alternating pattern. In the

following dependent neural simulations, four neurons were created with each class

producing 5000 samples. The data sets also contain 100 channels of fake neurons in

order to assess the robustness of the models (as explain in prior sections).

88

Figure 5-2. Neural tuning depth of four simulated neurons

89

5.2.2 Independent Neural Simulation Results

Clustering with the LM-HMM.

Figure 5-3 demonstrates the clustering results using the LM-HMM on the two-class

simulated data set consisting of independent neurons. For this particular experiment,

the model parameters are set (during training) for two classes (k = 2) which is equal to

the true number of classes in the simulation data. Additionally, the class labels alternate

since they represent the alternating kinematics (shown at the bottom of figure). As

seen from the figure, the model is able to correctly cluster the data in a relatively small

number of iterations (three to four). For the first iteration, each exemplar in the full data

set is randomly assigned to one of the clusters (indicated by green and blue colors). For

the remaining iterations, a pattern starts to emerge that looks similar to the alternating

kinematics. Although the kinematics (cosine and sine wave) are shown below the class

labels, the clustering results were acquired solely from the input space.

Figure 5-3. LM-HMM cluster iterations (two classes, k=2)

90

Figure 5-4 shows the class tuning preference when the model is initialized with

random data. The term ’tuning preference’ refers to the angular preference of the

particular class (or cluster) label. The quantity is calculated the same way as in neural

tuning, except that the data is collected from the samples that have been labeled by a

particular class (i.e circular statistics are calculated on the kinematics from class 3 rather

than a neuron). This figure clearly shows that before modeling the random class labeling

has not introduced preferences for any particular angle.

Figure 5-4. Tuning preference for two classes (initialized)

Figure 5-5 shows the angular preference of the classes after clustering. Overlaid

in blue is the original angular tunings of some of the neurons. Clearly the model is able

to successfully find the separation in neural firings, since each class is represented

by a different angular tuning (similar to respective neurons). Although angular velocity

shows itself to be a useful kinematic feature to separate the clusters, the models are not

solely restricted to this type of feature. Only in this experiment did the most prevalent

feature appear to be velocity. In later experiments with simulated and real neural data,

this kinematic feature will not be the most obvious.

For the simulation in Figure 5-6, the LM-HMM is used to cluster a four-class

simulated data set (k = 4). Again, the correct number of clusters k = 4 is set during

training to match the true number of classes in the input data (another oscillating

91

Figure 5-5. Tuned classes after clustering (two classes)

pattern). The figure shows a few issues with shrinkage and expansion with respect

to the class labels. In other words some of the class labels are extended or short

of the actual size. These effects are due to the temporal-spatial data (not static

classification) and the model’s propensity to stay in a particular state. Overall the

final result demonstrates that the clustering model with the LM-HMM is able to discover

the underlying clusters present in the simulated independent neural data.

Figure 5-7 shows the clustering on the initial random labels while Figure 5-8 shows

the tuned preference of the four classes after clustering. Remarkably, the model is able

to determine the separation from the four classes using the neural input only. Next, the

clustering model is tested for robustness when the number of clusters is unknown or

increased noise is added to the neural data.

As with all clustering algorithms, choosing the correct number of underlying clusters

is difficult. Choosing the number of clusters for BMI data is even more difficult since

there are no known or established ground truths (with respect to motion primitives).

Figure 5-9, illustrates when the clustering model is initialized with four classes (k = 4)

despite the simulation only containing two underlying classes (or clusters) for the input

space. Again, the results are generated within a relatively small number of iterations.

Notice from the figure that the extra two class labels are absorbed into the two classes

92

Figure 5-6. LM-HMM cluster iterations (four classes, k=4)

shown in the previous Figure 5-3 (also shown below the four class labels). Interestingly,

a repeated pattern of consistent switching occurs with the class labels (as indicated by

the pattern of color blocks). Specifically, Figure 5-9 shows that class 1 precedes class 3

and class 2, when combined, they correspond to class 1 in Figure 5-3, while class 4 in

figure 5-9 corresponds to class 2 in Figure 5-3. Remarkably, the neural data from such a

simple simulation is complicated yet the clustering method finds the consistent pattern of

switching (perhaps indicating that the simple classes are further divisible).

To further test robustness, random spikes are added to the unbinned spiked trains

of the earlier tuned neurons (of Figure 5-2). Specifically, uniformly random spikes are

generated with a probability of spiking every 1ms. Figure 5-10 shows the classification

performance as the probability of firing is increased from a 1% chance of spiking to 16%

chance of spiking in 1ms. Interestingly, performance does not decrease significantly.

The robustness is due to the tuned neurons still maintaining their underlying temporal

structure. Figure 5-11 shows the tuning polar plots of the four original neurons with the

93

Figure 5-7. Tuning preference for four classes (initialized)

added random spikes. Although this figure shows that tuning broadens across many

angular bins, the random spikes do not have a temporal structure. Therefore they do not

displace the temporal structure of the tuned neurons significantly (as indicated by only

a small change in performance). Please note that increasing the probability of random

spikes to 16% every 1ms puts the spiking beyond the realistic firing rate of real neurons.

As explained, the different artificial neurons are modulated so that their tuning depth

monotonically increased(i.e. β set from 1 to 4). The LM-HMM clustering successfully

selects the neurons in the correct order (respective to tuning depth) from the 100

random neural channels. The result is the same when the tuned neurons are corrupted

with random spike noise.

94

Figure 5-8. Tuned classes after clustering (four classes)

Given that the clustering model still achieves good performance during noise, it

is important ensure that the simulation is not too simplistic. Therefore, two surrogate

data sets are generated from the simulation data. The first surrogate is generated by

randomizing the spatial relationships between the neurons. Specifically, at each time bin

the bin counts for each neuron are randomly switch with another channel. This process

is repeated through the length of the data set where the spatial relationship of bin N is

different than bin N − 1.

Figure 5-12 shows the clustering results on such a surrogate dataset. The

clustering model correctly fails to cluster the data since the spatial information is ruined.

95

Figure 5-9. LM-HMM cluster iterations (two classes, k=4)

Figure 5-10. Classification degradation with increased random firings

96

Figure 5-11. Neural tuning depth with high random firing rate

Figure 5-13 shows the tuned preference of the clusters after clustering. This figure

correctly shows that there is no tuned preference since the model failed (as is hoped).

The second surrogate is generated by randomizing the temporal relationships

between the neurons. Specifically, at each time bin the bin counts for all of the neurons

are randomly switched with another bin in time (keeping the same channel). This

process is repeated through the length of the data set where the temporal relationship

is destroyed but the spatial relationship is kept intact. Figure 5-14 shows the clustering

results on such a surrogate dataset. It correctly fails to cluster the data since the spatial

information through time is ruined. Figure 5-15 shows the tuned preference of the

97

Figure 5-12. Surrogate data set destroying spatial information

Figure 5-13. Tuned preference after clustering (spatial surrogate)

clusters after clustering. This figure correctly shows that there is no tuned preference

since the model failed (as is hoped).

98

Figure 5-14. Surrogate data set destroying temporal information

Figure 5-15. Tuned preference after clustering (temporal surrogate)

99

Clustering with the DC-HMM.

Figure 5-16 demonstrates the clustering results using the DC-HMM on the two-class

simulated data set. For this particular experiment the number of clusters k = 2. The

figure shows that the model is able to correctly cluster the data in a relatively small

number of iterations (three to four). The figure also shows that on the first iteration the

different class labels were randomly assigned to the two classes (indicated by green

and brown colors) start to converge to a pattern similar to the kinematics from which the

input was derived (bottom of figure cosine and sine wave). Although the kinematics are

shown below the class labels, the clustering results are acquired solely from the input

space.

Figure 5-16. DC-HMM clustering results (class=2, K=2)

Figure 5-17 illustrates the hidden state transitions for the DC-HMM on the simulated

data set. Interestingly a repeating pattern matches the corresponding kinematics.

Additionally the figure shows that pairs of neurons are actively involved in the hidden

state space. This pairing is expected since the simulated neurons are in pairs for the

100

two classes with the extra three neurons being random noise. The result is further

confirmed with Figure 5-18 where only four neurons have active couplings. Furthermore,

the correct coupling is observed between neurons.

Figure 5-17. DC-HMM clustering hidden state transitions (class=2, K=2)

Figure 5-18. DC-HMM clustering coupling coefficient (class=2, K=2)

Figure 5-19 presents the likelihoods generated at each iteration of the clustering

round. The two likelihood figures are for each class. This figure clearly shows a

101

converging likelihood for this particular data set. Overall, the DC-HMM captures the

underlying classes/clusters on the simulated neural data .

Figure 5-19. DC-HMM clustering log-likelihood reduction during each round (class=2,K=2)

For the simulation in Figure 5-20, the correct number of clusters k = 4 matches the

underlying number of classes in the input data. The clustering results demonstrate when

four classes are actually present in the neural data. The model correctly matches the

underlying clusters present in the neural data. Unfortunately, the DC-HMM clustering

produces a few more errors than the LM-HMM clustering of the same data.

Figure 5-21 shows the hidden state transitions for the DC-HMM on the simulated

data set with four classes. Interestingly, repeating patterns are observed that correspond

to the kinematics but not as obvious as the two class version. Additionally from this

figure, pairs of neurons are actively involved with regard to hidden state space. This

pairing is expected since the simulated neurons are in pairs for the two classes without

the extra noise neurons. This result is further confirmed with Figure 5-22 where the

correct couplings are observed between neurons.

102

Figure 5-20. DC-HMM clustering simulated neurons (class=4, K=4)

103

Figure 5-21. DC-HMM clustering hidden state space transitions between neurons

104

Figure 5-22. DC-HMM clustering coupling coefficient between neurons (per Class)

105

Self-Organizing-Maps.

To provide a fair comparison of the clustering methods described above, the

simulated data sets are clustered with one of the most common clustering techniques

available, Self-Organizing Maps (SOM). A SOM or sometimes called a Kohonen map

[73], is a type of unsupervised neural network. The goal of this model is to learn in

an unsupervised way the representation of the input space. SOM are also different

from other neural networks since they use a neighborhood function to preserve the

topological properties of the input space. For more details on the SOM please see

Appendix C.

For the experiments described below, various numbers of processing elements

(PE’s) were tested empirically. The best compromise in computational complexity and

performance resulted in using 25 PE’s. The initial step size used for the ordering phase

was .9 while the converging or mapping phase started with a .02 step size. Although the

static version of the SOM was tested (which failed on this type data), time embedding

was added to the SOM (Appendix C) in order to provide the best comparison.

Figure 5-23 shows the SOM results on the two class simulation with independent

neurons. The figure shows a successful clustering of the oscillating classes (as

discussed earlier). Classification performance was 96.4%. Overall, this clustering

model is able to cluster the simplistic simulation.

In order to test the robustness of the SOM, noise was added to the simulation.

Figure 5-24 shows when the probability of randomly firing is increased to 16% per time

bin (which is beyond real neural firing rates). Clearly the figure and classification results

(92.3%) demonstrate that the SOM is successful and robust enough to cluster the

simulation with noisy independent neurons.

Next, spatial and temporal surrogates are used to further test the robustness of

the SOM. Figure 5-25 presents the clustering results with the spatial surrogate data.

The classification results are 85.58% which is completely incorrect since all of the

106

spatial information has been destroyed. The SOM is incorrectly capturing the large

peeks in the firing rates of some of the neurons (figure 5-26). By default, the spatial

surrogate randomly places the firing rates of the different neurons with other neurons.

This is interesting since the SOM is failing and is artificially capturing structure that

does not exist and is simply selecting the bursting activity from some of the neurons.

This incorrect result may help to explain why the some of the previous BMI results with

non-linear filters worked better with the ballistic food grasping tasks since those types of

neurons would be more modulated (i.e. bursting) at coinciding points to the movement.

In contrast, the neurons are always firing and modulating as in the cursor control task

thereby decreasing the performance of the linear/non-linear models (while the graphical

models do not suffer as much).

Finally, temporal surrogate data is clustered with the SOM. In this instance the SOM

fails (as expected) at clustering this dataset (figure 5-25). The clustering results were

Figure 5-23. SOM clustering on independent neural data (2Classes)

107

Figure 5-24. SOM clustering on independent neural data with noise (2Classes)

Figure 5-25. SOM clustering on independent neural data spatial surrogate(2classes)

108

Figure 5-26. Neural selection by SOM on spatial surrogate data(2classes)

just above random (53.04%), which is valid for a data set that has temporal structure

destroyed.

109

Figure 5-27. SOM clustering on independent neural data temporal surrogate(2classes)

110

5.2.3 Dependent Neural Simulation Results

Figure 5-28 shows the spike output (100ms bins) from the dependent neurons. In

the figure, four dependently generated neurons and 100 fake neurons are shown with

darker colors indicating a higher firing rate. The figure does not provide discernible

patterns, even though an alternating pattern is underlying the data (similar to original

independent neurons). Figure 5-29 provides further evidence that simplistic patterns do

not exist since the neurons are not specifically tuned to any particular angle.

Figure 5-28. Output from four simulated dependent neurons with 100 noisechannels(Class=2)

Figure 5-30 shows the clustering results using the LM-HMM with the clustering

methodology. Despite not observing any visual oscillating pattern in the neural data, the

model is able to correctly cluster (as seen in the pattern). The figure also demonstrates

that only a small number of clustering iterations are needed to discern the pattern. With

respect to the 104 neurons the model was able to correctly identify the four neurons that

were pertinent to the clusters.

111

Figure 5-29. Neural tuning for dependent neuron simulation

Interestingly, Figure 5-31 shows that the DC-HMM is also able to correctly cluster

the neural data with dependencies.

The most interesting results are shown in Figure 5-32. The figure shows the

clustering results from the time-embedded SOM. Clearly the SOM is not successful in

clustering the oscillating classes. The classification results are slightly above random

(53.98%). The SOM is not capturing the pattern since individual neurons are not

bursting or modulating with significant increases in firing rate. Although a state-based

model generated the data, this simulation provides more evidence that the graphical

models have the ability to capture the communication between neurons through time.

112

Figure 5-30. LM-HMM clustering simulated dependent neurons (class=2, K=2)

Figure 5-31. DC-HMM clustering simulated dependent neurons (class=2, K=2)

113

Figure 5-32. SOM clustering on dependent neural data (2classes)

114

5.3 Experimental Animal Data

5.3.1 Rat Experiments

For the first experiment, 5000 data points of the single lever press experiment are

used for clustering. With the initial clustering round, all of the data points are randomly

labeled for the two classes (lever press and non-lever-press). These experiments call for

many parameters to be initialized. These include

1. Observation length (window size)

2. Number of states

3. Number of clustering rounds

The observation length was varied from 5 to 15 which corresponds to .5 seconds to

1.5 seconds. The number of hidden states were varied from 3 to 5. Finally, the number

of rounds were varied from 4 to 10. After exhausting the different combinations, an

observation length (time window) of 10 was selected along with 3 hidden states and 6

rounds. These parameters were kept the same for all neural channels.

Figure 5-33, illustrates that the model provides reasonable clustering of the two

classes in the single lever press. With respect to classification performance, the

model is able to correctly classify each class around 66% of the time. Remarkably,

despite 66% being a low number, the unsupervised clustering model achieves a

classification performance that rivals the supervised classification results. Of course

to compare classifications the class labels must be assigned since they are unknown

after clustering. The labels are assigned based on kinematic features, like a lever press,

and the different priors (lever presses have far less samples than not-lever presses).

Under normal clustering (without classification comparisons), the class labels are

unknown. Most likely the experimental setup will need to focus on simple tasks by

which the patient expresses their desired goals (move arm left etc). This patient-based

direction would allow the different class labels to be appropriately assigned.

115

Figure 5-33. Rat clustering experiment, one lever, two classes

Figure 5-34 presents a slightly expanded picture of figure 5-33. Notice how the

clustering fails in certain locations. These failures may be due to the actual experiment

since the rat moves around the cage without being recorded.

In the next experiment, 10000 data points of the two-lever rat experiment are used

for clustering. For the initial clustering round, all of the data points are randomly labeled

for the two classes. The same parameters selected in the previous experiment are used

for this experiment. Unlike the previous experiment, this experiment includes the time

location for cue-signals and rewards. Figure 5-35 shows where the cue signals and

rewards are located as well as the lever presses (red is left, green is right). For example,

on the fifth cue-signal the rat was supposed to press left but instead pressed right (as

indicated by colors on the plot).

Figure 5-35 shows consistent and repeating clustering patterns. Unfortunately, the

results are not as good as the experiment with a single lever press. The difference may

116

Figure 5-34. Rat clustering experiment zoomed, one lever, two classes

Figure 5-35. Rat clustering experiment, two lever, two classes

117

be attributed to the type of experiment since more primitives are being employed for this

double-level experiment (as well as the lack of data discussed earlier).

Figure 5-36. Rat clustering experiment, two lever, three classes

Figure 5-36 shows the results when clustering the data for the two-lever press into

three classes. Interestingly there are some consistent results but visually the results are

difficult to interpret. The consistencies involve the transition from one class to another

class (like red to green or green to blue). These transitions coincide when a kinematic

event is exhibited (i.e lever pressed). Additionally, these transitions are similar to what

was observed in the simulations.

Figure 5-37 again shows some interesting behavior with the clustering results when

using four classes. There are consistencies when looking at the transitions from one

class to another before and after the lever presses. Unfortunately, there aren’t even 3 or

4 kinematic events that are applicable for classification. Only qualitative interpretation is

appropriate for this dataset. This lack of quantifiable metrics is why the monkey data and

simulations are more pertinent to testing the discussed methodologies.

118

Figure 5-37. Rat clustering experiment, two lever, four classes

Overall, the clustering results for the rat data are mixed. In future work, perhaps

isolating the time data around the lever presses when the rat is more focused on the

task could improve modeling. Although not presented here, preliminary results indicate

this improvement to be the case.

119

5.3.2 Monkey Experiments

LM-HMM.

Two important questions must be answered with the following experiments.

First, what type of clustering results are obtained, i.e. are there repeating patterns

corresponding to the kinematics. Second, how does the trajectory reconstruction

from the unsupervised clustering compare against the trajectory reconstruction from

supervised BMI algorithms. Classification performance is not considered in these

experiments since there are no known classes by which to test. Although angular bins

might serve the purpose for single neurons, it is known that some neurons do not simply

modulate for angular bins. Therefore, correlation coefficient will serve as the metric and

allow a consistent comparison between the results in this paper and previous work.

Multiple Monte Carlo simulations were computed to eliminate spurious effects from

initial random conditions (class labels, parameters, etc). The parameters that require

initialization include:

1. Observation length (window size)

2. Number of states


4. Number of classes

To determine the parameters, the observation length (window size) was varied

from 5 to 15 time bins (corresponding to .5 seconds to 1.5 seconds). The number of

hidden states was varied from 3 to 5. While the number of clustering iterations varied

from 4 to 10. After exhausting the different combinations, the parameters are set to: an

observation length equal to 5 time bins, 3 hidden states and 6 clustering iterations (since

less than 5% of the labels changed). These parameters are the same for each neural

channel.

The model was initialized with four classes after an empirical search (using different

parameter sets). Since ground truths are unknown, trajectory reconstruction serves

120

as the basis for how many clusters to select. Specifically, based on reconstruction

performance, an adjustment is made to the number of clusters. Qualitatively, Figure 5-38

shows the labeling results from the LM-HMM clustering. The Y-Axis represents the

number of iterations from initial random labels (top), to the final clustering iteration

(bottom). Each color in the clustering results corresponds to a different class (four in all).

The kinematics (x and y velocities) are overlaid at the bottom of the image for this cursor

control experiment. Figure 5-38 shows repeating patterns for similar kinematic profiles.

These repetitive class transitions were also observed in the simulated data. Figure 5-39

shows trajectory reconstruction matches very closely to the original trajectory thereby

indirectly validating the segmentation produced by the clustering method (qualitatively).

Figure 5-38. LM-HMM cluster iterations (Ivy 2D dataset, k=4)

For a quantitative understanding, the correlation coefficient (CC) is a way to show

if the clustering results have merit. Interestingly, the CC results for this unsupervised

clustering are slightly better than the supervised non-linear TDNN and linear Wiener

filter as shown in Table 5-1. As expected random labeling of the classes produces

121

Figure 5-39. Reconstruction using unsupervised LM-HMM clusters (blue) vs. realtrajectory (red)

poor results compared to actual clustering. Additionally the random labeling results

are similar to other supervised BMI models. As discussed earlier, the similar results

are due to the random clusters providing generalization of the full space for each filter

(thereby becoming equivalent to a single Wiener filter). Remarkably, Table 5-1 shows

that the correlation coefficient produced by the unsupervised LM-HMM clustering is only

slightly less than the correlation coefficient produced with the supervised version of the

LM-HMM. This result is understandable since the supervised version of the LM-HMM

consistently isolates the neural data based on kinematic features that were imposed by

the user in labeling the classes.

122

Table 5-1. Correlation coefficient using LM-HMM on 2D monkey data


LM-HMM (unsupervised) .77±.07 .66±.13

SOM (unsupervised) .65±.04 .59±.12

LM-HMM .79±.04 .75±.11


NMCLM(FIR) .67±.03 .50±.07

NMCLM(Gamma) .67±.02 .47±.07

NLMS .68±.03 .50±.08

Gamma .70±.02 .53±.09

Subspace .70±.03 .58±.08

Weight Decay .71±.03 .57±.08

Kalman .71±.03 .58±.10

TDNN .65±.03 .51±.08

123

DC-HMM.

In the following experiments, the DC-HMM is used to cluster the food grasping

task and the cursor control task. For each experiment, Wiener filters are applied to

the clustered labels for each class in order to see how the trajectory reconstruction is

compared to the supervised reconstruction. An observation length of five and three

hidden states are used to keep the computations low (and empirically showed adequate

results). Additionally k was to four classes based on empirical results.

Table 5-2 shows that the correlation coefficient on the unsupervised DC-HMM

clustering reconstruction is better than using the SOM, random labeling or a single

Wiener filter. Unfortunately, the results are not as good as the supervised version of the

DC-HMM or LM-HMM. Supervised results are expected to be better since the kinematic

features can serve as classes that demarcate partition-able points in the input space

that allow the model to specialize (as opposed to globally learning the class labels).


Experiment CC

DC-HMM (unsupervised) .72±.19

SOM (unsupervised) .70±.18


NLMS .68±.20

Figure 5-40 demonstrates that there is a corresponding clustering pattern to the

kinematics. Unfortunately there are areas of error or at least what could be perceived

as error since there are no known ground truths to test against. Nevertheless, the CC

shows to be slightly degraded from a supervised version. The degraded results are

attributable to a lack of information between channels. Figure 5-41 shows there are

again more independent neurons than dependent. Perhaps the LM-HMM exploits the

independent neurons better since it builds a consensus among the channels rather than

modeling an explicit joint distribution.

124

Figure 5-40. DC-HMM clustering on monkey food grasping task (2classes)

Figure 5-41. Coupling coefficient from DC-HMM clustering on monkey food graspingtask (2classes)

For the cursor control monkey data, Table 5-3 illustrates that the DC-HMM produces

similar CC results to that of the LM-HMM. Again, both of these clustering results are

better than the supervised versions of the non-linear TDNN and single linear Wiener

125

filter. One reason for an improvement on the cursor control experiment over the food

grasping is that the type of movements tends to be very velocity based (quick) as

opposed position based (holding in the air). Another interesting difference is that the

cursor control task has more dependent neurons as shown in Figure 5-42. This would

help to explain why the Wiener filters produce better results on the food grasping

task over cursor control (i.e. independent neurons are simply modulating with task).

Additionally, the simulations with dependent neurons showed that the state-space model

were better than the SOM at clustering this type of data.

Table 5-3. Correlation coefficient using DC-HMM on cursor control data


DC-HMM clustering .71±.09 .65±.13

LM-HMM clustering .77±.07 .66±.13

SOM clustering .65±.04 .59±.12


NLMS .68±.03 .50±.08

TDNN .65±.03 .51±.08

Figure 5-43 shows the effect of averaging the firing rates across a time window

equal to the observation length (five time bins). Each particular class produces

a different firing pattern across the six selected channels. Interesting, some of the

channels fire more in the beginning portion of the window while others fire more towards

the end. Additionally, Figure 5-44 shows the corresponding average kinematics of the

four classes. As mentioned earlier, the monkey makes mostly diagonal movements and

this is clearly observed in this figure (as is hoped).

126

Figure 5-42. Coupling coefficient from DC-HMM clustering on monkey cursor controltask (4classes)

Figure 5-43. Average firing rate per class (4classes,6neurons)

127

Figure 5-44. Average velocity per class (4classes,6neurons)

128

5.3.3 Discussion

The clustering model discussed in this chapter demonstrated the ability to discover

useful clusters while operating solely in the neural input space. The results were first

justified with realistic neural simulations that also included noisy and fake neurons.

Despite the added noise, the clustering method is able to successfully determine the

underlying separation. The division of neural input space was based on the hypothesis

that animals transition between neural state structures during goal seeking analogous to

the motion primitives exhibited during the kinematics [35]. Then the clustering method

was compared to conventional BMI signal processing algorithms on real neural data.

Although, trajectory reconstruction is used to show the validity of the clusters, the model

could be used as front end for a co-adaptive algorithm or goal-oriented tasks (simple

classification that paraplegics could select, i.e. move forward).

Despite these encouraging results, improvements in performance are achievable

for the hierarchical clustering. For example, the generative models in the hierarchical

clustering framework may not be taking full advantage of the dynamic spatial relationships.

Although the hierarchical training methodology does create dependencies between the

HMM experts, perhaps there are better ways to exploit the dependencies or aggregate

the local information. As shown with the coupling coefficients and Viterbi state paths

from the DC-HMM results, there may be important dependencies since different neural

processes are interacting with other neural processes in an asynchronous fashion

and that underlying structure could provide insight into the intrinsic communications

occurring between neurons.

As a final point, there was an interesting effect from the experiments (simulated

and real neural data). Looking closely at some of the results, consistent transitions

occur from different classes to other classes. For example there may be a consistent

transition from class 1 to class 3 and class 2 to class 1. Investigating this phenomenon

further would be interesting. Perhaps there is a switching behavior between stationary

129

points in the input space that could be exploited. This exploration could also give rise to

neurophysiologic understanding of the underlying communication between neurons. The

Chapter 6 explores this possibility by looking within the model’s state transitions.

130

CHAPTER 6CONCLUSION AND FUTURE WORK

Brain machine interfaces have the potential to restore movement to patients

experiencing paralysis. Although great progress has been made towards BMIs there

is still much work to be done. This dissertation addressed some of the problems

associated with the signal processing side of BMIs.

Since there is a lack of understanding in how neurons communicate within the

brain, probabilistic modeling was argued as the best approach for BMI experiments.

Specifically, generative models were proposed with hidden variables to help model the

multiple interacting processes (both hidden and observable). This lead to paradigm

shift similar to divide and conquer but with the generative models. The generative

models also demonstrated that they can solely operate in the neural input space. This

is appropriate since the desired kinematic data is not available from paralyzed patients.

Most BMI algorithms ignore this very important point.

Three major limitations were also addressed with generative models in this

dissertation. These include training the parameters, defining the neural state structure,

and segmenting the data (or clustering). Chapter three dealt with training the parameters

and making a-priori assumptions about the neural state structures. A simple independence

assumption was first assumed then the chapter further explored implicit dependencies

and explicit dependencies between neural channels. Hierarchal modeling frameworks

were presented while spatial relationships were demonstrate to be important since

they can improve or maintain results with less data. Additionally, competitive training

methodologies were introduced that allowed reduction of the input space by have the

neurons competing for inclusion into the final models. Finally, a fully coupled structure

was discussed to see if adding more dependencies to the model improved performance.

This fully coupled structure showed the limitations with increasing the complexity with

131

data limitations. With respect to modeling, the fully coupled structure provides insights

into the underlying biological relationships between neurons.

A-priori knowledge of the graphical structure and class labels were the final

problems with BMI data that the dissertation addressed. Chapter five presented a

clustering approach in which the data space was partitioned based on a likelihood

criterion. This treated the model as a centroid for each cluster and the parameters

were found accordingly. In the next section discussing future work, we will show the

inverse methodology and cluster in the model space using an optimal state-path finding

algorithm to define chain-like models across time and neural channels. Although the

results are similar to the likelihood-based method it provides a way to understand

structure and dependencies between channels in time. Understandably some results

are similar since both methods are approximations to unknown joint probabilities

between many variables (of which we do not have enough data to support a full

network).

Some of the advantages of the generative models over the conventional BMI

signal processing algorithms were also confirmed. This included the modeling neurons

that decrease firing during movement and the ability to separate the neural input

space. The partitioning of the input neural space was based on the hypothesis that

animals transition between neural state structures during goal seeking. These neural

structure are analogous to the motion primitives exhibited during the kinematics [35].

The results were justified with the improvement in trajectory reconstruction. Specifically,

the correlation coefficient on the trajectory reconstruction served as a metric to compare

against other BMI methods. Additionally, simulations were used to show the model’s

ability to cluster unknown data with underlying dependencies. This is necessary since

there are no ground truths in real neural data.

Overall the work described in this dissertation demonstrated improved approaches

to modeling BMIs. This work also addressed the most important problem of patients

132

without limbs (i.e. training with desired kinematic data). We did this by presenting

clustering results on simulated and real data that outperform supervised results.

Ultimately the modeling methodologies within this dissertation can be used as a front

end for either forward modeling (i.e. linear filter etc) or for a BMI goal-oriented system.

6.1 Future Work

Chapter 5 described a clustering methodology that partitioned the data-space with

the LM-HMM and DC-HMM models. These probabilistic models represent dynamic

temporal data and act as centroids while the computed likelihood of the respective

models discriminates for the appropriate cluster labeling (similar to K-means). In

effect, this clustering methodology produces a natural grouping of the spatial-temporal

exemplars into K clusters. Unfortunately, only using the data space limits the model’s

ability to discriminate clusters. Perhaps the richness of the model space could serve

in clustering the data and allow insight into the neurophysiological interaction between

neurons.

Therefore, this chapter explores a clustering methodology that is based on the

model space or the actual model structure (i.e. neural structure) to discriminate the

neural data for particular motion primitives. Specifically, the model structure directly

relates to the dependencies of the hidden states representing neural dependencies (at

least within the data rather than biologically). In the previous chapters, the goal was

to find the global (inter-model) dependencies between channels while now the local

(intra-model) dependencies are sought through time and across the neural channels.

Figure 6-1 shows the Bipartite graph in which the models are now sequestering the data

points.

Clustering the model structures or parameters requires a structure learning

algorithm. These types of algorithms employ searching and scoring functions in

order to build the structure of the model. Since the number of model structures is

large (exponential), a search method is needed in order to decide which structures

133

Figure 6-1. Bipartite graph of exemplars (x) and models

to score. Even a graphical model with a small number of nodes contains too many

networks to exhaustively score. A greedy search could be done by starting with an initial

network (with/without connectivity) and iteratively adding or deleting an edge, measuring

the accuracy of the resulting network at each stage, until a local maxima is found.

Alternatively, a method such as simulated annealing could guide the search to the global

maximum. Iterating through all of the structures is computationally expensive. The

problem is further complicated when multiple structures must be determined for multiple

clusters or classes. Unfortunately, when the classes are undefined (i.e. the experiments

in this dissertation), finding these structures is intractable. Therefore a simple iterative

method must be employed to approximate the structures for each respective class.

Rather than use a conventional search method, a single generative model will be

trained over the full data set. Then a state-path finding algorithm similar to Viterbi will

find the most likely paths for each data exemplar. These paths represent the plausible

dependency structure between channels through time. Once these structures are

134

found, the corresponding neural data will be grouped together. Obviously the number of

structures will be large. Therefore these structures need to be merged into fewer cluster

centers otherwise computation complexity could be become significantly worst.

The next section will explore this method for finding plausible structures. First,

an optimal state path finding algorithm similar to Viterbi will be discussed. Then a

simple histogram methodology will be outlined for clustering the structures. Finally, a

simple way to trim the number of structures will be discussed. After the methodology is

explained experiments with the simulation data as well as the real neural data will be

discussed.

6.1.1 Towards Clustering Model Structures

There are several ways to find the optimal state sequence associated with given

observation sequences. The difficulty lies with the definition of the optimal state

sequence, i.e. there are several optimality criteria. We define

ζt(i) = P (q(c)t = Q

(c)i |O, λ) (6–1)

i.e. the probability of being in state Q(c) at time t, given the observation sequence

O, and the model λ in chain c. Equation 6–1 can be expressed simply in terms of the

forward variables, i.e.,

ζ(c)t (i) =

α(c)t (i)

P (O|λ)=

b(c)j (Ot)

∑Cc=1 Θcc

∑N(c)

i=1 (αct−1(i)a

(c,c)ij )

∏Cc=1(

∑N(C)

j=1 α(c))T (j))

(6–2)

since α(c)t (i) accounts for the partial observation sequence O

(c)1 O

(c)2 ...O

(c)t and

the contributions from each of the channel’s previous states The normalization factor∏C

c=1(∑N(C)

j=1 α(c))T (j) makes ζ

(c)t (i) a probability measure so that

135

C∑c=1

N∑i=1

ζ(c)t (i) = 1. (6–3)

Using ζ(c)t (i), the most likely state qt at time t can be solved individually, as

qt = argmax[ζ(c)t (i)], 1 ≤ t ≤ T (6–4)

By selecting the most likely state on a particular neural channel for each time step,

the size of the model structure is greatly simplified. Essentially a chain structure Gk is

constructed through time and across the channels. Since each model Gk has the same

number of parameters, using BIC to score the models is inappropriate. Therefore a

different approach is necessary to group the models Gk. Finding a different approach

is difficult since there are (CN)T structures even in the simplified formulation discussed

above.

One clue comes from looking at the empirical results. Empirically there are far fewer

realizations of the structures k < (CN)T . By further limiting the number of observed

models to greater than two, significantly less models are empirically observed (merging

the single observations into a single cluster). This led to a simple histogram method

to cluster the models Gk. In order to further reduce the number of models, the top two

models observed the most are chosen (with the rest of the samples relabeled as a third

class).

Let data set D consist of N sequences for J neural channels, D = S11 , ..., S

JN , where

Sjn = (Oj

1, ...OjT ) is a sequences of observables length T.

Specifically the full data set is used and all exemplars S train a single model.

Initially, the parameters are randomized until they converge to an initial guess. Then

the likely state path is found for each training exemplar and a histogram is built from the

found structures. The algorithm is as follows:

136

1. Train the DC-HMM λ on the full data set D as described in Chapter 3

2. Compute the most probable paths qt = argmax[ζ(c)t (i)], 1 ≤ t ≤ T on the state

trellis as described above per each sequence S

3. Develop the appropriate histogram respective of each model structure.

4. Re-label data set D based on K class assignment and train new models

5. Repeat step 2 until threshold is met or convergence

6.1.2 Preliminary Results

Independent neural simulation data.

For this particular experiment the number of clusters k = 2. Empirically, the number

of hidden states was varied from 2 to 5 with the final selection being 3 hidden states for

the model. Based on prior results a sequence length of 5 is chosen along with 2 to 6

rounds of clustering.

Figure 6-2 demonstrates the clustering results using the DC-HMM on the two-class

simulated data set. The figure clearly shows that the model is able to correctly cluster

the data. The classification result (based on prior labeling) is 82% which is comparable

to the likelihood-based clustering in Chapter 5. Although the kinematics is shown below

the class labels, the clustering results are acquired solely from the input space.

Figure 6-3 shows the Viterbi paths for the DC-HMM on the simulated data set.

Interestingly, there are repeating patterns that correspond to the kinematics. Additionally

from this figure, pairs of neurons are actively involved in the state transitions in the

hidden state space. This is expected since the simulated neurons are in pairs for the two

classes.

Food grasping task.

Two results are to be observed from the following experiments. First, what type

of clustering results is obtained, i.e. are there repeating patterns that can be seen

corresponding to the kinematics. Second, how does the trajectory reconstruction

compare against the supervised version. Classification is not computed since there are

137

Figure 6-2. Hidden state transitions DC-HMM (simulation data 2classes)

no known ground truths for the input space (everything like angular bins is imposed by

the user) and unknown kinematic features may be represented in the neural data.

In consideration of spurious results, multiple Monte Carlo simulations were executed

to confirm empirical results. Additionally, there are many parameters to be initialized.

These include:

1. Observation sequence length (window size)

2. Number of states


4. Number of classes

The observation length was varied from 5 to 15 time bins (100ms), which correspond

to .5 seconds to 1.5 seconds. The number of hidden states was varied from 3 to 5.

Finally, the number of clustering iterations was varied from 4 to 10. After exhausting the

138

Figure 6-3. Hidden state transitions DC-HMM (simulation data 2classes)

different combinations the parameters were set as the following: observation length of 5,

3 hidden states and 6 clustering rounds (since less than 5% of the labels changed).

Table 6-1 shows that the correlation coefficient on the unsupervised DC-HMM

model-based clustering reconstruction is better than using random labeling or a single

Wiener filter. Unfortunately, the result is only as good as the likelihood-based clustering.

One reason for this is that both clustering techniques rely on approximations. Figure 6-4

shows the histogram for the state models. Clearly there are a reasonable number of

models found empirically with a particular model observed the most (corresponding to


Experiment CC

DC-HMM (model clustering) .70±.21

DC-HMM (likelihood clustering) .72±.19


NLMS .68±.20

139

non-movement). Figure 6-5 shows some of the actual Viterbi paths taken by the model.

The y-axis represents the different models while the x-axis is the time bins (of length

five). Each color represents a particular state observed by a corresponding neuron

(similar colors represent the same neuron in the same state).

Figure 6-4. Histogram of state models for the DC-HMM (food grasping task)

Although the underlying approximations lead the final results to be similar,

Figure 6-6 the computed α’s shows that the underlying structure and dependencies

can be discerned. This information may prove useful in a biological setting or future

front-end analysis for different algorithms and needs further exploration.

Cursor control.

For the cursor control monkey data, Table 6-2 shows that the model-based

clustering produces slightly worst results than the likelihood-based clustering. Both of

these clustering results are again better than the supervised versions of the non-linear

TDNN and single linear Wiener filter.

140

Figure 6-5. State models for the DC-HMM (food grasping task)

Figure 6-6. Alphas computed per state per channel DC-HMM (food grasping task)

141

Another example is in Figure 6-7 which shows the α’s computed across channels

and states while the kinematic variable (velocity difference between the two dimensions)

is also plotted on the figure. Obviously there are difficulties in characterizing the

underlying neural structure with respect to a kinematic structure. This is due to how

complicated arm kinematics is executed with millions of muscle fibers.

Figure 6-7. Alphas across state-space of the DC-HMM (cursor control task)



DC-HMM model clustering .65±.09 .61±.15

DC-HMM likelihood clustering .71±.09 .65±.13

LM-HMM clustering .77±.07 .66±.13


NLMS .68±.03 .50±.08

TDNN .65±.03 .51±.08

142

Although the clustering the model parameters yielded similar results to clustering

in the data space, there is an added benefit to observing dependent structures. This

may prove to be beneficial from a biological perspective and yield better modeling in the

future as a front-end system.

Additionally, there is much to be improved over the simple naive k-means clustering

of the models. More advanced techniques could also be used to find probable structure

paths in the path optimization stage of the algorithm. The single path requirement could

also be relaxed to provide richer models.

Interestingly, the difference between this model-based clustering and the likelihood

version is that clustering is occurring dynamically within the model as we train with each

new sequence. With the likelihood version, clustering is occurring after each round

rather since the models are updated afterwords. This may lead to a dynamic adaptation

of the structures as new sequences are acquired after an initial clustering on a training

set.

6.2 Contributions

The following list summarizes the contributions of this dissertation

• Showed that changing the paradigm of BMIs of partitioning the input is beneficialthru the use of a bi-model structure and kinematic reconstruction (as well assimulations)

• Developed a method to determine importance of individual neurons in modelperformance with respect to a graphical modeling framework (BM-HMM neuralselections)

• Developed a simple yet powerful method based on boosting and competitivetraining to create weak dependencies between neural channels and allows forreduction of the input space

• Demonstrated graphical model’s ability to capture neurons that decrease firingduring movement as just a particular state

• Developed a hierarchical modeling methodology that modeled stronger dependenciesbetween neural channels without dramatically increasing computational complexityon a reduced neural subset

143

• Demonstrate powerful simulations that could isolate the strengths and weaknessesof the different models

• Developed clustering methods that segmented the input space using log-likelihoodas a distance metric using the model as a centroid

• Developed simple clustering method based on the model space (i.e. hidden statetransitions) and a simple search and score algorithm

144

APPENDIX AWIENER FILTER

The Wiener-Hopf solution is used to estimate the weight matrix for the Wiener filter

WWiener = R−1P (A–1)

where R is the correlation matrix of neural spike inputs with the dimension of (L ·M)x(L ·M),

R=

r11 r12 .. r1M

r21 r22 .. r2M

. . . .

. . . .

rM1 rM2 .. rMN

and rij is the LxL cross-correlation matrix between neurons i and j (i 6= j), and rii

is the LxL autocorrelation matrix of neuron i. P is the (L ·M)xC cross-correlation matrix

between the neuronal bin count and hand position as

P=

p11 .. p1C

p21 .. p2C

. . .

. . .

pM1 .. pMC

where pic is the cross-correlation vector between neuron i and the c-coordinate

of hand position. Given the assumption that the error is a white Gaussian distribution

and the data is station, then the estimated weights WWiener are found to be optimal.

Essentially, W TWienerx minimizes the mean square error (MSE) cost function,

J = E[‖e‖2], e = d− y (A–2)

Each sub-block matrix rij can be decomposed as

145

rij=

rij(0) rij(1) .. rij(L− 1)

rij(−1) rij(0) .. rij(L− 2)

. . . .

. . . .

rij(1− L) rij(2− L) .. rij(0)

where rij(τ) represents the correlation between neurons i and j with time lag

τ . These correlations, which are the second order moments of discrete-time random

processes xi(m) and xj(k), are the functions of the time difference (m − k) based on

the assumption of wide sense stationary (m and k denote discrete time instances for

each process) [59, 19]. In this case, the estimate of correlation between two neurons,

rij(m− k), can be obtained by

rij(m− k) = E[xi(m)xj(k) ≈ 1

N − 1

N∑n=1

xi(n−m)xj(n− k),∀i, j ∈ (1, ..., M) (A–3)

The cross-correlation vector pic can be decomposed and estimated in the same

way. rij(τ) is estimated using equation A–3 from the neuronal bin count data with xi(n)

and xj(n) being the bin count of neurons i and j respectively. From equation A–3, it can

be seen that rij(t) is equal to rji(−t).

146

Figure A-1. Topology of the linear filter for three output variables

147

APPENDIX BPARTIAL DERIVATIVES

To simplify notation,

let δ =

1, x = y

0, x 6= yand z

(c,c)ijt = a

(c,c)ij b

(c)j (o

(c)t ). Let ~w = (π, A, B, Θ) be a parameter

vector, then

α(c)t (j) =

π(c)j b

(c)j (o

(c)1 ), t = 1

∑c

∑i z

(c,c)ijt α

(c)t−1(i), 2 ≤ t ≤ T

(B–1)

∂P

∂w=

∑c

(P

P (c)

∂P (c)

∂w) =

∑c

(P

P (c)

N∑j=1

∂α(c)T (j)

∂w) (B–2)

Using

∂α(c)t (j)

∂w

∑c

∑i

(∂z

(c,c)ijt

∂wα

(c)t−1(i) + z

((c),c)ijt

∂α(c)t−1(i)

∂w), 2 ≤ t ≤ T (B–3)

produces the first order derivatives of α(c)t (j) with respect to each type of parameter

as follows

∂α(c)t (j)

∂π(c1)i

=

δijδc,c1b(c1)j (o

(c1)1 ), t = 1

∑Cc=1

∑N(c)

k=1 z(c,c)kjt

∂α(c)t−1(k)

∂π(c1)i

, 2 ≤ t ≤ T(B–4)

∂α(c)t (j)

∂a(c1,c2)i1j1

=

0, t = 1

δc,c2δj,j1Θc1,c2b(c2)j1

(o(c2)t )α

(c2)t−1(i1) +

∑c

∑i z

(c,c)ijt

∂α(c)t−1(i)

∂a(c1,c2)i1j1

, 2 ≤ t ≤ T(B–5)

∂α(c)t (j)

∂b(c1)j1

(k)=

δo1(c),kδc,c1δj,j1π

c1j1

, t = 1∑

c

∑i(δo1

(c),kδc,c1δj,j1Θc,c1a(c,c1)ij1

α(c)t−1(i) + z

(c,c)ijt

∂α(c)t−1(i)

∂b(c1)j1

(k)), 2 ≤ t ≤ T

(B–6)

148

∂α(c)t (j)

∂Θc1c2

=

0, t = 1

δc,c2

∑i a

(c1,c2)ij b

(c2)j (k)α

(c1)t−1(i) +

∑c

∑i z

(c)ijt

∂α(c)t−1(i)

∂Θc1,c2, 2 ≤ t ≤ T

(B–7)

149

APPENDIX CSELF ORGANIZING MAP

The SOM is most related to the idea of soft competition. The weights and inputs of

the model form a mapping between each. Essentially, the weight vector of a particular

PE that is closest to the present input wins the competition. But with the SOM the

neighbors of the winning PE have their weights updated according to the competitive

rule

wij(n + 1) = wi(n) + ηyi(n)(xj(n)− wij(n)) (C–1)

The lateral inhibition network is assumed to produce a Gaussian distribution centered

at the winning PE. This allows the algorithm to just find the winning PE and assume the

other PEs have an activity proportional to the Gaussian function evaluated at each PE’s

distance from the winner. The SOM competitive rule becomes

wi(n + 1) = wi + ∆i,i∗(n)η(n)(x(n)− wi(n)) (C–2)

where the ∆ function is a neighborhood function centered at the winning PE. During

each iteration the neighborhood function and step size change. The neighborhood

function ∆ is a Gaussian for the experiments described in the dissertation:

∆i,i∗(n) = exp(−d2

i,i∗

2σ2(n)) (C–3)

with a variance that decreases with each iteration. As first, the full map is almost

covered, then at each iteration the variance reduces to a neighborhood of zero, finally

allowing only the winning PE to be updated. A linear decrease in neighborhood radius is

specified by

σ(n) = σ0(1− n/N0) (C–4)

150

Note that for the winning PE the adaptation rule defaults to the competitive update

wi∗(n + 1) = wi∗(n) + η(x(n)− wi∗(n)) (C–5)

The updates for the neighbors are reduced exponentially by the distance to the

winning PE. The network moves from a soft competition to hard competition as the

neighborhood shrinks.

Since the winning PE and its neighbors are updated at each step, the winner and

all of its neighbors move toward the same position, although the neighbors move more

slowly as their distance from the winning PE increases. Over time, this organizes the

PEs so that neighboring PEs (in the SOM output space) share the representation of the

same area in the input space (are neighbors in the input space), regardless of their initial

locations.

Figure C-1. Self-Organizing-Map architecture with 2D output

There are two phases in SOM learning. The first phase deals with the initial

ordering of the weights. During this phase the neighborhood function starts large,

151

covering the full output space to allow PEs that respond to similar inputs to be brought

together. The learning rate is also set to a large value (greater than 0.1) to allow the

network to self-organize. The scheduling of η is normally also linear

∆η(n) = η0(1− n/(N + K))h (C–6)

where η0 is the initial learning rate and K helps specify the final learning rate.

The second phase of learning is called the convergence phase. In this longer phase

of the SOM, the learning rate is to a smaller value (0.01) while using the smallest

neighborhood (just the PE or its nearest neighbors). This is to achieve a fine-tuning

of the weights. Similar to determining the number of the clusters with most clustering

models, the choosing the number of PEs is done empirically. The amount of training

time and accuracy is balanced with how many PEs are chosen for the SOM

152

REFERENCES

[1] Kamil, A. C. (2004). Sociality and the evolution of intelligence. Trends in CognitiveScience, 8, 195-197.

[2] David, M. (2002). The sociological critique of evolutionary psychology: Beyondmass modularity. New Genetics and Society, 21, 303-313.

[3] Walter J. Freeman, Mass Action in the Nervous System, University of California,Berkeley, USA

[4] J. Wessberg, C. R. Stambaugh, J. D. Kralik, P. D. Beck, M. Laubach, J. K. Chapin,J. Kim, S. J. Biggs, M. A. Srinivasan, and M. Nicolelis et al., “Real-time predictionof hand trajectory by ensembles of cortical neurons in primates,” Nature, Vol. 408,pp. 361-365, 2000.

[5] Nobunga, A.I., Go, B.K., Karunas, R.B. (1999) Recent demographic and injurytrends in people served by the model spinal cord injury care systems. Arch. Phys.Med. Rehabil., 80, pp. 1372-1382.

[6] A. B. Schwartz, D. M. Taylor, and S. I. H. Tillery, “Extraction algorithms for corticalcontrol of arm prosthetics,” Current Opinion in Neurobiology, Vol. 11, pp. 701-708,2001.

[7] E. C. Leuthardt, G. Schalk, D. Moran, and J. G. Ojemann, ”The emerging world ofmotor neuroprosthetics: A neurosurgical perspective,” Neurosurgery, vol. 59, pp.1-13, Jul 2006.

[8] M. A. L. Nicolelis, D. F. Dimitrov, J. M. Carmena, R. E. Crist, G. Lehew, J. D.Kralik, and S. P. Wise, “Chronic, multisite, multielectrode recordings in macaquemonkeys,” PNAS, Vol. 100, No. 19, pp. 11041 - 11046, 2003.

[9] A. P. Georgopoulos, J. T. Lurito, M. Petrides, A. B. Schwartz, and J. T. MasseyMental rotation of the neuronal population vector, Science 13 January 1989: Vol.243. no. 4888, pp. 234 - 236

[10] M.A. Lebedev, and M. A. L. Nicolelis, “Brain-machine interfaces: past, present andfuture,” Trends Neurosci 29, Vol 18, pp. 536-546, 2006

[11] Donoghue, J.P. (2002) Connecting cortex to machines: recent advances in braininterfaces. Nature Neurosci. Suppl., 5, pp. 1085-1088.

[12] M. A. L. Nicolelis, D.F. Dimitrov, J.M. Carmena, R.E. Crist, G Lehew, J. D.Kralik, and S.P. Wise, ”Chronic, multisite, multielectrode recordings in macaquemonkeys,” PNAS, vol. 100, no. 19, pp. 11041 - 11046, 2003.

[13] E. Todorov, On the role of primary motor cortex in arm movement control,InProgress in Motor Control III, ch 6, pp 125-166, Latash and Levin (eds), HumanKinetics .

153

[14] A. Georgopoulos, J. Kalaska, R. Caminiti, and J. Massey, ”On the relationsbetween the direction of two-dimensional arm movements and cell discharge inprimate motor cortex.,” Journal of Neuroscience, vol. 2, pp. 1527-1537, 1982.

[15] F. Wood, Prabhat, J. P. Donoghue, and M. J. Black. Inferring attentional state andkinematics from motor cortical firing rates . In Proceedings of the 27th Conferenceon IEEE Engineering Medicine Biologicial System

[16] S. Darmanjian, S. P. Kim, M. C. Nechyba, S. Morrison, J. Principe, J. Wessberg,and M. A. L. Nicolelis, “Bimodel Brain-Machine Interface for Motor Control ofRobotic Prosthetic,” IEEE Int. Conf. on Intelligent Robots and Systems, pp.112-116, 2003.

[17] J. C. Sanchez, J. C. Principe, and P. R. Carney, ”Is Neuron DiscriminationPreprocessing Necessary for Linear and Nonlinear Brain Machine InterfaceModels?,” accepted to 11th International Conference on Human-ComputerInteraction, vol. 5, pp. 1-5, 2005.

[18] J. DiGiovanna, J. C. Sanchez, and J. C. Principe, ”Improved Linear BMI Systemsvia Population Averaging,” presented at IEEE International Conference of theEngineering in Medicine and Biology Society, New York, pp. 1608-1611, 2006.

[19] S. P. Kim, J. C. Sanchez, D. Erdogmus, Y. N. Rao, J. C. Principe, and M. A.L.Nicolelis, “Divide-and-conquer Approach for Brain-Machine Interfaces: NonlinearMixture of Competitive Linear Models,” Neural Networks, Vol. 16, pp. 865-871,2003.

[20] J.C. Sanchez, S.-P. Kim, D. Erdogmus, Y.N. Rao, J.C. Principe, J. Wessberg,and M. Nicolelis, “Input-Output Mapping Performance of Linear and NonlinearModels for Estimating Hand Trajectories from Cortical Neuronal Firing Patterns,”International Workshop for Neural Network Signal Processing, pp. 139-148, 2002.

[21] Wu, W., Black, M. J., Gao, Y., Bienenstock, E., Serruya, M., Shaikhouni, A., andDonoghue, J. P. (2003). Neural decoding of cursor motion using a Kalman filter.Advances in Neural Information Processing Systems 15 (pp. 133-140). MIT Press.

[22] L. R. Hochberg, M. D. Serruya, G. M. Friehs, J. A. Mukand, M. Saleh, A. H.Caplan, A. Branner, D. Chen, R. D. Penn, and J. P. Donoghue, ”Neuronalensemble control of prosthetic devices by a human with tetraplegia,” Nature,vol. 442, pp. 164-171, 2006.

[23] J. DiGiovanna, B. Mahmoudi, J. Fortes, J. C. Principe, and J. C. Sanchez,”Co-adaptive Brain-Machine Interface via Reinforcement Learning,” IEEETransactions on Biomedical Engineering, in press, 2008.

[24] Jing Hu, Jennie Si, Byron Olson, Jiping He. ”A support vector brain-machineinterface for cortical control of directions.” The First IEEE/RAS-EMBS International

154

Conference on Biomedical Robotics and Biomechatronics. Pisa, Italy. February20-22, 2006, pp. 893- 898.

[25] S. Darmanjian, S. P. Kim, M. C. Nechyba, J. Principe, J. Wessberg, and M.A. L. Nicolelis, “Bimodel Brain-Machine Interface for Motor Control of RoboticProsthetic,” IEEE Machine Learning For Signal Processing, pp. 379-384, 2006.

[26] B. M. Yu, C. Kemere, G. Santhanam, A. Afshar, S. I. Ryu, T. H. Meng, M.Sahani, K. V. Shenoy, (2007) Mixture of trajectory models for neural decodingof goal-directed movements. Journal of Neurophysiology. 97:3763-3780.

[27] S. Darmanjian and J. Principe, “Boosted and Linked Mixtures of HMMs forBrain-Machine Interfaces,” EURASIP Journal on Advances in Signal Processing,vol. 2008, Article ID 216453, 12 pages doi:10.1155/2008/216453

[28] S. Darmanjian, A. R. C. Paiva, J. C. Principe, M. C. Nechyba, J. Wessberg, M. A.L. Nicolelis, and J. C. Sanchez, “Hierarchal decomposition of neural data usingboosted mixtures of independently coupled hidden markov chains,” InternationalJoint Conference on Neural Networks, pp. 89-93, 2007.

[29] G. Radons, J. D. Becker, B. Dulfer, J. Kruger. Analysis, classification, andcoding of multielectrode spike trains with hidden Markov models. Biol Cybern71: 359-373, 1994.

[30] I. Gat. Unsupervised learning of cell activities in the associative cortex of behavingmonkeys, using hidden Markov models. Master thesis, Hebrew Univ. Jerusalem(1994).

[31] C. Kemere, G. Santhanam, B. M. Yu, A. Afshar, S. I. Ryu, T. H. Meng, K. V.Shenoy (2008) Detecting neural state transitions using hidden Markov models formotor cortical prostheses. Journal of Neurophysiology. 100:2441-2452

[32] B. H. Juang, and L. R. Rabiner, “Issues in using Hidden Markov models for speechrecognition,” Advances in speech signal processing, edited by S. Furui and M.M.Sondhi, Marcel Dekker, inc., pp. 509-553, 1992

[33] Krogh, A. (1997) Two methods for improving performance of a HMM and theirapplication for gene finding In Gaasterland, T., Karp, P., Karplus, K., Ouzounis, C.,Sander, C., and Valencia, A. (Eds.), Proc. of Fifth Int. Conf. on Intelligent Systemsfor Molecular Biology pp. 179186 Menlo Park, CA. AAAI Press

[34] I. Cadez and P. Smyth, “Probabilistic Clustering using Hierarchical Models,”Technical Report No. 99-16 Department of Information and Computer ScienceUniversity of California, Irvine

[35] L. Goncalves, E. D. Bernardo and P. Perona, Movemes for Modeling BiologicalMotion Perception Book Series Theory and Decision Library Volume Volume 38Book Seeing, Thinking and Knowing Publisher Springer Netherlands

155

[36] M. I. Jordan, Learning in Graphical Models, MIT Press, 1999.

[37] N. G. Hatsopoulos, Q. Xu, and Y. Amit, “Encoding of Movement Fragments in theMotor Cortex,” J. Neurosci., Vol. 27, No. 19, pp. 5105-5114, 2007

[38] X. Huang, A. Acero, H. W. Hon, and R. Reddy , Spoken Language Processing:A Guide to Theory, Algorithm and System Development, Prentice Hall Inc.,Englewood Cliffs, NJ.

[39] R.B. Northrop, Introduction to Dynamic Modeling of Neurosensory Systems, CRCPress, Boca Raton, 2001.

[40] R. P. Rao, “Bayesian computation in recurrent neural circuits,” Neural Computa-tion, Vol. 16, No. 1, pp. 1-38.

[41] M. I. Jordan, Z. Ghahramani, and L. K. Saul, “Hidden Markov decision trees,” InM.C. Mozer, M.I Jordan, and T. Petsche, editors, Advances in Neural InformationProcessing Systems 9, MIT Press, Vol. 9, 1997.

[42] L. E. Baum, T. Petrie, G. Soules and N. Weiss, “A Maximization TechniqueOccurring in the Statistical Analysis of Probabilistic Functions of Markov Chains,”Ann. Mathematical Statistics, Vol. 41, No. 1, pp. 164-71, 1970.

[43] S. Zhong and J. Ghosh. HMMs and coupled HMMs for multi-channel EEGclassification. In Proc. IEEE Int. Joint Conf. Neural Networks, pages 1154-1159,May 2002.

[44] M. Brand, “Coupled hidden Markov models for modeling interacting processes, ”Technical Report 405, MIT Media Lab Perceptual Computing, 1997.

[45] J. Yang, Y. Xu and C. S. Chen, ”Human Action Learning Via Hidden MarkovModel,” IEEE Trans. Systems, Man and Cybernetics, Part A , vol. 27, no. 1, pp.34-44, 1997.

[46] J. Kwon and K. Murphy. Modeling freeway traffic with coupled HMMs. Technicalreport, University of California at Berkeley, May 2000.

[47] T. T. Kristjansson, B. J. Frey, and T. Huang. Event-coupled hidden Markov models.In Proc. IEEE Int. Conf. on Multimedia and Exposition, volume 1, pages 385-388,2000.

[48] Z. Ghahramani and M.I. Jordan, “Factorial Hidden Markov Models,” MachineLearning, Vol. 29, pp. 245-275, 1997.

[49] Y. Bengio and P. Frasconi. Input-Output HMMs for sequence processing. IEEETrans. Neural Networks, 7(5):1231-1249, September 1996.

[50] Y. Linde, A. Buzo and R. M. Gray, An Algorithm for Vector Quantizer Design,IEEETrans. Communication, vol. COM-28, no. 1, pp. 84-95, 1980.

156

[51] W. T. Thach, ”Correlation of neural discharge with pattern and force of muscularactivity, joint position, and direction of intended next movement in motor cortexand cerebellum,” Journal of Neurophysiology, vol. 41, pp. 654-676, 1978.

[52] R. E. Schapire, Y. Freund, P. Bartlett, and W. S. Lee, “Boosting the margin: a newexplanation for the effectiveness of voting methods,” In Proc. 14th InternationalConference on Machine Learning, pp. 322-330, 1997.

[53] R. E. Schapire, “The strength of weak Learnability,” Machine Learning, Vol. 5, pp.197-227, 1990.

[54] Y. Freund, R. E. Schapire, “Experiments with a new boosting algorithm,” MachineLearning: Proceedings of the Thirteenth International Conference, pp. 148-156,1996.

[55] R. A. Jacobs., M. I. Jordan., S. J. Nowlan, and G. E. Hinton, “Adaptive mixtures oflocal experts,” Neural Computation, Vol. 3, pp. 79-87, 1991.

[56] R. Avnimelech and N. Intrator, “Boosted mixture of experts: An ensemble learningscheme, ” Neural Computation, Vol. 11, No. 2, pp. 483-497, 1999.

[57] David C. Knill and Alexandre Pouget, The Bayesian brain: the role of uncertaintyin neural coding and computation,TRENDS in Neurosciences Vol.27 No.12December 2004

[58] G. Dornhege. Increasing Information Transfer Rates for Brain-ComputerInterfacing (pdf). Phd thesis, University of Potsdam, Germany

[59] Haykin, S. (1996) Adaptive filter theory. Upper Saddle River, NJ: Prentice Hall.

[60] Haykin, S. (1996) Neural networks: A comprehensive foundation. New York, NY:McMillan.

[61] M. Meila and M. I. Jordan, “Learning fine motion by Markov mixtures of experts,”In D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, editors, Advances in NeuralInformation Processing Systems 8, MIT Press, Vol. 8, 1996.

[62] Donoghue, J. P. and S. P. Wise (1982). ”The motor cortex of the rat:cytoarchitecture and microstimulation mapping.” J. Comp. Neurol. 212(12):76-88.

[63] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classication andRegression Trees, Wadsworth International Group, Belmont, CA, 1984

[64] Y. Sun and J. Li, “Iterative RELIEF for feature weighting,” Proc. 23rd InternationalConference on Machine Learning, ACM Press, pp. 913-920, 2006.

[65] M. A. Lebedev., J. M. Carmena., J. E. O’Doherty., M. Zacksenhouse., C. S.Henriquez, J/ C. Principe, and M. A. L. Nicolelis, “Cortical ensemble adaptation to

157

represent actuators controlled by a brain machine interface,” J. Neurosci., Vol. 25,pp. 4681-4693

[66] L. E. Baum and G. R. Sell. Growth transformations for functions on manifolds.Pacific Journal of Mathematics, pages 211-227, 1968.

[67] J. C. Sanchez, J. C. Principe, and P. R. Carney, “Is Neuron DiscriminationPreprocessing Necessary for Linear and Nonlinear Brain Machine InterfaceModels,” 11th International Conference on Human-Computer Interaction, 2005.

[68] J. Kittler, M. Hatef, R. Duin, and J. Matas, “On combining classifiers,” IEEE Trans.Pattern Analysis and Machine Intelligence, Vol. 20, No. 3, pp. 226-239, 1998.

[69] M. Martinez-Ramon, V. Koltchinskii, G. L. Heileman, and S. Posse, “MRI PatternClassification Using Neuroanatomically Constrained Boosting,” NeuroImage, Vol.31, No. 3, pp. 1129-1141, 2006.

[70] P.Viola and M. Jones, “Rapid object detection using a boosted cascade of simplefeatures, ” CVPR, Vol 1, pp. 511-518, 2001.

[71] A. Bastian, G. Schoner, and A. Riehle, “Preshaping and continuous evolution ofmotor cortical representations during movement preparation,” European Journal ofNeuroscience, Vol. 18, No. 7, pp. 2047-2058, 2003.

[72] M. C. Nechyba, Learning and Validation of Human Control Strategies ,CMU-RI-TR-98-06, Ph.D. Thesis, The Robotics Institute, Carnegie MellonUniversity, 1998.

[73] Kohonen, T. (1982), ”Self-Organized Formation of Topologically Correct FeatureMaps”, Biological Cybernetics, Vol. 43, pp. 59-69.

[74] A. Y. Ng, M. I. Jordan, and Y. Weiss. On spectral clustering: analysis and analgorithm. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances inNeural Information Processing Systems 14, pages 849-856. MIT Press, 2002.

[75] S. Zhong and J. Ghosh. A Unified Framework for Model-based Clustering. Journalof Machine Learning Research. vol. 4, pp. 1001-1037. November 2003.

[76] E. P. Simoncelli, L. Paninski, J. Pillow, and O. Schwartz (2004). Characterizationof neural responses with stochastic stimuli. The New Cognitive Neurosci., 3rdedition, MIT Press

[77] Y. Wang, J. Sanchez, and J. C. Principe, (2007b). Information TheoreticalEstimators of Tuning Depth and Time Delay for Motor Cortex Neurons. NeuralEngineering, 2007. 3rd International IEEE/EMBS Conference on. 502-505

[78] S. Darmanjian, “Generative Neural Structure Clustering for Brain MachineInterface”,Ph.D. proposal, University Of Florida, Gainesville, Fl,2008

158

[79] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul, An introduction tovariational methods for graphical models., Learning in Graphical Models. MITPress, 1998.

[80] J. Alon, S. Sclaroff, G. Kollios, and V. Pavlovic, “Discovering clusters in motiontime-series data Computer Vision and Pattern Recognition,” Proceedings. 2003IEEE Computer Society Conference , Vol. 1, pp. I-375- I-381, 2003.

[81] A. Ypma and T. Heskes, “Categorization of web pages and user clustering withmixtures of hidden markov models,” In Proceedings of the International Workshopon Web Knowledge Discovery and Data Mining, Edmonton, Canada, pp. 31-43,2002.

[82] G. S. Fishman, Monte Carlo: concepts, algorithms, and applications.,Springer-Verlag, 1995.

159

BIOGRAPHICAL SKETCH

Shalom Darmanjian graduated from the University of Florida with a Bachelor of

Science in Computer Engineering in December 2003. After completing his masters in

May 2005, Shalom continued the pursuit of knowledge and moved to the CNEL lab for

the Ph.D. program during the fall of 2005. He received his Ph.D. from the University of

Florida in the fall of 2009. Shalom hopes to continue doing his small part in improving

the world.

160

design and analysis of generative models...

Documents