cse 599e lecture 4: signal processing and machine learning for … · 2012. 4. 6. · r. rao, 599e:...
TRANSCRIPT
1 R. Rao, 599E: Lecture 4
CSE 599E
Lecture 4:
Signal Processing and
Machine Learning for BCI
2 R. Rao, 599E: Lecture 4
What’s on the menu?
F Signal Processing Spike sorting
Frequency analysis
Wavelets
Time domain analysis
Bayesian Filtering, Kalman and Particle filtering
F Machine Learning Regression: Linear regression, Neural networks
Classification: LDA, SVM
Cross Validation
3 R. Rao, 599E: Lecture 4
Components of Brain-Computer Interfacing
Classification
or Regression
4 R. Rao, 599E: Lecture 4
Feature Extraction and Signal Processing
5 R. Rao, 599E: Lecture 4
Invasive Recording: Spike Sorting
Sorting by Amplitude
Sorting using Window
Discriminators
6 R. Rao, 599E: Lecture 4
Signal Processing for Semi-Invasive and Non-
Invasive Recordings
7 R. Rao, 599E: Lecture 4
Example: ECoG Signal
Rest Period Hand movement
Slow oscillations
have disappeared
Faster fluctuations
are apparent
8 R. Rao, 599E: Lecture 4
Decomposing a signal into components:
Frequency domain analysis
=
+
+
Recorded
signal
Increasing
frequency
9 R. Rao, 599E: Lecture 4
Example of Frequency Domain (Fourier) Analysis
Signal as a
function of
time
Signal as a
function of
frequency
10 R. Rao, 599E: Lecture 4
Fourier Analysis
F Express any signal as an infinite sum of sines and cosines
F Obtain each “Fourier coefficient” via correlation with signal:
11
0
21210
)sin()cos(2
)2sin()sin()2cos()cos(2
)(
n
n
n
n tnbtnaa
tbtbtataa
ts
dttntsT
b
dttntsT
a
T
Tn
T
Tn
)sin()(2
)cos()(2
2/
2/
2/
2/
11 R. Rao, 599E: Lecture 4
Amplitude and Power Spectra
F Amplitude Spectrum
F Power Spectrum
F Fast Fourier transform (FFT): Efficient way of computing
coefficients and power spectrum For signal with T time points, runs in time approximately T log T
22
2
1)( nn banA
)(4
1)()(
222
nn banAnP
)2sin()sin()2cos()cos(2
)( 21210 tbtbtata
ats
12 R. Rao, 599E: Lecture 4
ECoG Signal Power Spectrum
Low- and High-Frequency Band
(LFB/HFB) “features” can be used to
detect hand movement
13 R. Rao, 599E: Lecture 4
Problem with Fourier Analysis
F Sines and cosines occupy an
infinite temporal extent
F Fourier transform does a poor
job of representing finite and
non-periodic signals, and
signals with sharp peaks and
discontinuities
14 R. Rao, 599E: Lecture 4
Wavelets
F Represent signals as
linear combinations
of finite basis
functions called
wavelets
F Each wavelet is a
scaled and translated
copy of a single
mother wavelet
F Allows multi-
resolution analysis
Example Mother Wavelets
15 R. Rao, 599E: Lecture 4
Example: Wavelet Decomposition of EEG
(Hinterberger et al., 2003)
16 R. Rao, 599E: Lecture 4
Autoregressive (AR) Models
F Natural signals tend to be correlated over time
F Predict the next measurement based on past few values:
F Adaptive Autoregressive (AAR) model uses time-varying
set of coefficients ai,t:
F The ai,t capture local statistics of signal and can be used as
features for classification in a BCI
it
p
i
it xax1
tit
p
i
tit xax
1
,
17 R. Rao, 599E: Lecture 4
Bayesian Filtering:
Kalman filters and Particle filters
18 R. Rao, 599E: Lecture 4
Bayes’ Rule
normalizer
prior likelihood
)(
)()|()(
)()|()()|(),(
yP
xPxyPyxP
xPxyPyPyxPyxP
Posterior
19 R. Rao, 599E: Lecture 4
Example: Estimating brain state from neural data
F Suppose we measure ECoG signal z
F What is P(HandMovement|z)?
20 R. Rao, 599E: Lecture 4
Causal vs. Diagnostic Reasoning
F P(HandMovement|z) is diagnostic.
F P(z|HandMovement) is causal.
F Often causal knowledge is easier to obtain or learn.
F Bayes rule allows us to convert causal knowledge to
diagnose events:
)(
)()|()|(
zP
HMPHMzPzHMP
count frequencies!
21 R. Rao, 599E: Lecture 4
Hand Movement Example
F P(z|HM) = 0.6 P(z|HM) = 0.3
F P(HM) = P(HM) = 0.5
67.03
2
45.0
3.0
5.03.05.06.0
5.06.0)|(
)()|()()|(
)()|()|(
zHMP
HMPHMzPHMPHMzP
HMPHMzPzHMP
If z is measured, it raises the probability of hand
movement from 0.5 to 0.67
From past data
22 R. Rao, 599E: Lecture 4
Combining Measurements over Time
F Suppose we make another measurement z2.
F How can we integrate this new measurement?
F More generally, how can we estimate
P(x| z1...zn )?
23 R. Rao, 599E: Lecture 4
Bayesian Filtering
),,|(
),,|(),,,|(),,|(
11
11111
nn
nnnn
zzzP
zzxPzzxzPzzxP
Markov assumption: zn is independent of z1,...,zn-1 if we know x.
)()|(
),,|()|(
),,|(
),,|()|(
),,|(
),,|(),,,|(),,|(
...1
...1
11
11
11
11
11111
xPxzP
zzxPxzP
zzzP
zzxPxzP
zzzP
zzxPzzxzPzzxP
ni
in
nn
nn
nn
nn
nnnn
From previous time step!
Bayes’ Rule
24 R. Rao, 599E: Lecture 4
Kalman Filtering
F State linearly evolves according to:
F Measurement linearly related to state:
F nt and mt are zero-mean Gaussian noise with covariance
matrices Q and R respectively
F Bayesian filtering:
F Because everything is Gaussian and linear, posterior is also
Gaussian:
ttt nAxx 1
ttt mBxz
11111
111
1
),,|()|()|(
),,|()|(),,|(
tx
tttttt
tttttt
dxzzxPxxPxzP
zzxPxzPzzxP
t
),ˆ(),,|( 1 tttt SxNzzxP
25 R. Rao, 599E: Lecture 4
Kalman Filter Equations
F Prediction from one time step to next:
F Correction upon measuring zt:
F “Kalman” gain dictates how much weight to
give to prediction error versus prediction
QAASM
xAx
T
tt
tt
1
1ˆ
1 RBSK T
tt
111 )(
)(ˆ
t
T
t
ttttt
MBRBS
xBzKxx
)( tt xBz
Mean
Covariance
Mean
Covariance
tx
26 R. Rao, 599E: Lecture 4
Kalman Filter: Prediction and Correction Cycle
27 R. Rao, 599E: Lecture 4
Kalman filter assumes linear and Gaussian
model
How do we handle non-linearity and multi-
modal distributions?
28 R. Rao, 599E: Lecture 4
Enter…Particle Filtering
Samples approximate posterior probability distribution
New measurement provides importance weights
Importance sampling Prediction step and new samples
29 R. Rao, 599E: Lecture 4
Spatial Filtering (across electrodes) and
Feature Extraction
30 R. Rao, 599E: Lecture 4
Spatial filtering: Simple Techniques
F Bipolar filtering
F Laplacian filtering
F Common Average
Re-referencing
jiji sss =~,
.4
1=~
i
i
ii sss
.1
=~
1=
i
N
i
iiN
sss
31 R. Rao, 599E: Lecture 4
What if you are faced with this…
http://people.brandeis.edu/~sekuler/imgs/rwsEEG.jpg
32 R. Rao, 599E: Lecture 4
Reducing Dimensionality of Input Data
High Density EEG Recording
Can we extract and use a small set of features rather than
the large number of input channels?
http://people.brandeis.edu/~sekuler/imgs/rwsEEG.jpg
33 R. Rao, 599E: Lecture 4
Principal Component Analysis (PCA)
F Convert high-dimensional
input into a small set of
linear “features” Eg., 256 EEG channels
10 “features”
F Features determined by
directions in data with
maximal variance
F Data can be reconstructed
(with some error) by linear
combination of features
Example: 2D to 1D
34 R. Rao, 599E: Lecture 4
Principal Component Analysis (PCA)
F Suppose each of the N data points is L-dimensional Form L x L data covariance matrix A
F Compute eigenvectors of A these are the “feature” vectors
or spatial filters
F Pick N largest eigenvalues and use their eigenvectors as your
features/filters
N
i
T
iiN
A1
))((1
xxxx
35 R. Rao, 599E: Lecture 4
Example: PCA for images of faces
F PCA extracts the
eigenvectors v1, v2, v3,
... vK of covariance
matrix A
F Each one of these
vectors is a direction in
image space What do these look like?
36 R. Rao, 599E: Lecture 4
Low Dimensional Representation and Reconstruction
Filtering: Input is converted to eigenvector coordinates using dot
products (“projection”):
Reconstructed image
Compressed representation (K usually much smaller than size of input)
Input
37 R. Rao, 599E: Lecture 4
PCA applied to EEG
(Jung et al., 1998)
5 of the 22
Eigenvectors
a1 a2
(spatial filters for EEG channels)
38 R. Rao, 599E: Lecture 4
Independent Component Analysis (ICA)
F PCA produces a matrix V of eigenvectors (columns of V)
such that the output a is uncorrelated:
F ICA tries to find a matrix W of filters (columns of W) such
that the output a is statistically independent:
F ICA assumes sources are linearly mixed to produce x
F ICA can be used to recover the “independent sources” of
mixed brain signals such as EEG
)( xxa TV
xaTW )()( that such
1
M
i
iaPP a
39 R. Rao, 599E: Lecture 4
ICA for unmixing EEG signals
http://sccn.ucsd.edu/~scott/icons/TimeCourseFig_web.jpg
40 R. Rao, 599E: Lecture 4
ICA for Artifact Removal in EEG
(Jung et al., 1998)
41 R. Rao, 599E: Lecture 4
Common Spatial Patterns (CSP)
F Data is labeled with class to which each data vector belongs E.g., EEG obtained for right versus left hand imagery
F CSP finds a matrix W of spatial filters (columns of W) where
such that the variance of the filtered data (components of
vector a) for one class is maximized while the variance of
the filtered data for the other class is minimized
F CSP filters can significantly enhance discrimination ability
between the two classes
xaTW
42 R. Rao, 599E: Lecture 4
CSP applied to EEG for Right/Left Hand Imagery
http://www.sciencedirect.com/science/article/pii/S0165027007004657
Right
Left
Less variance
43 R. Rao, 599E: Lecture 4
Components of Brain-Computer Interfacing
Classification
or Regression
44 R. Rao, 599E: Lecture 4
Outline
F Regression
Linear
Backpropagation neural networks
F Classification
Linear classifiers
Support vector machines
F Cross-validation
Model selection, preventing overfitting
45 R. Rao, 599E: Lecture 4
Linear Regression
x y
1 3.1
2 6.4
3.1 8.9
0.9 2
y
x
Assumption: Output is a linear function of input, i.e.,
yi= wxi + noise
where noise is independent, gaussian, unknown fixed variance
input
output
46 R. Rao, 599E: Lecture 4
Linear Regression
Given: Data (yi, xi) where yi are drawn from N(wxi, 2)
Likelihood of data (yi, xi) for a given w is:
i p(yi | w, xi) which is equal to
i exp( -0.5 (yi – wxi))
2/2 (ignoring constants)
Goal: Maximize the likelihood of data given w
i.e., maximize: i -0.5(yi-wxi)2/2
i.e., minimize: i (yi –wxi)2
Can show that w = xi yi / (xi)2
47 R. Rao, 599E: Lecture 4
But…typically, inputs in BCIs are
vectors of multiple neurons’
activities, multiple EEG
measurements, etc.
Need Multivariate Regression
Input vector
Output
(hand position)
48 R. Rao, 599E: Lecture 4
Multivariate Linear Regression
Suppose inputs xi are n-element vectors: yi =wTxi + noise
Write the m inputs as rows of matrix and outputs yi as vector:
Then, Y = Xw + noise
Find maximum likelihood w by minimizing over all data
w = (XTX)-1XTY
mnmm
n
n
xxx
xxx
xxx
21
22221
11211
X
my
y
y
2
1
Y
2XwY
49 R. Rao, 599E: Lecture 4
Nonlinear Regression: Neural Networks
Input nodes aeag
1
1)(
a
(a) 1
The most common
activation function:
Sigmoid function:
Squashes wTu to be between 0 and 1.
The parameter controls the slope.
g(a)
)(
)(
332211 uwuwuwg
gv T
uw
u = (u1 u2 u3)T
w
Output v
Want to learn a mapping from inputs to
outputs, given training data (um,dm).
How is w learned?
50 R. Rao, 599E: Lecture 4
Multilayer Networks
How do we learn these
weights?
Input u = (u1 u2 … uK)T
Output v = (v1 v2 … vJ)T; Desired = d
51 R. Rao, 599E: Lecture 4
Idea: “Backpropagation” Learning Rule
2)(2
1),( i
i
i vdE wW
Start with random weights {W, w}
Given input u, network produces
output v
Find W and w that minimize total
squared output error over all output
units (labeled i):
))(( k
k
kj
j
jii uwgWgv
ku
52 R. Rao, 599E: Lecture 4
Backpropagation:
Output Weights
j
j
jjiii
ji
ji
jiji
xxWgvddW
dE
dW
dEWW
)()(
)( j
j
jii xWgv
ku
jx
Learning rule for hidden-output weights W:
2)(2
1),( i
i
i vdE wW
{gradient descent}
53 R. Rao, 599E: Lecture 4
)( j
j
ji
m
i xWgv
m
ku
m
k
m
k
k
kjji
j
m
jji
m
i
m
i
imkj
kj
j
jkjkj
kjkj
uuwgWxWgvddw
dE
dw
dx
dx
dE
dw
dE
dw
dEww
)()()(
:But
,
{chain rule}
)( m
k
k
kj
m
j uwgx
Learning rule for input-hidden weights w:
2)(2
1),( i
i
i vdE wW
Backpropagation:
Hidden Weights
54 R. Rao, 599E: Lecture 4
Example Application in BCI
(Nicolelis, 2001)
55 R. Rao, 599E: Lecture 4
Motivation: Why classification?
F Many BCI applications require selecting 1 out of N
commands or menu choices. E.g., Word spellers: select 1 out of 26 letters
Menu with icons: select 1 out of N icons
Semi-autonomous robot: select 1 out of N high-level commands
F Classification techniques can used to classify a given brain
signal into 1 of several classes Simplest case: 2 classes binary classificiation
56 R. Rao, 599E: Lecture 4
Binary Classification: Example
Suppose BCI is used to move a 1D cursor left or right
Want to use brain signals for Imagined left hand movement
vs. Imagined right hand movement
Data points for left hand movements (class C1)
Data points for right hand
movements (class C2)
How do we classify new data points?
57 R. Rao, 599E: Lecture 4
Binary Classification: Linear Classifiers
Find a line (in general, a
hyperplane) (w, b)
separating the two sets
of data points:
g(x) = wTx + b = 0, i.e.,
w1x1 + w2x2 + b = 0
For any new point x, choose:
class C1 if g(x) > 0 and class C2 otherwise
x1
x2
g(x) = 0
g(x) < 0
g(x) > 0
C1 C2
58 R. Rao, 599E: Lecture 4
Linear Discriminant Analysis (LDA)
Assume each class is a Gaussian cloud
P(x|Ci) = N(i, ), i = 1,2
w and b are computed as follows:
w = -1(1 - 2)
b = wT(1 + 2)/2
Can show that this is equivalent to log likelihood ratio test for class 1 versus class 2
wTx + b = 0 C1
C2
59 R. Rao, 599E: Lecture 4
Actually, many possible separating lines…
Which one is best?
60 R. Rao, 599E: Lecture 4
Support Vector Machines (SVMs)
Choose hyperplane with largest margin
Facilitates generalization to new data points
Can be shown: margin = 2/ wTw
Maximize margin subject to the following constraints:
wTxi +b > +1 for class1 points, wTxi +b < -1 for class2 points.
“Margin”
Solved with Quadratic Programming.
61 R. Rao, 599E: Lecture 4
Overfitting and Generalization
Question: Which line is the best fit? Squiggly line overfits the data Won’t generalize to new data
Solution: Hold out some of the points, test regression or
classification performance on the held-out points What if points held out are bad (uncharacteristic of new data)?
Linear Regression Quadratic Regression
Squiggly-Regression
Holdout point
62 R. Rao, 599E: Lecture 4
Cross-Validation and Model Selection
F Repeat for different subsets of held-out points: For each choice of held-out points:
Fit model on remaining points Evaluate error on held-out points Add to total error score (this is the generalization error)
F Choose model with least generalization error
63 R. Rao, 599E: Lecture 4
Types of Cross-Validation
F K-fold cross-validation: Split data into k blocks.
Each block is used as hold out set.
F Leave-one-out cross-validation: Each point is held
out once, model trained on remaining points, tested
on held out point, and cycle repeated for all points.
64 R. Rao, 599E: Lecture 4
Next Class:
Invasive BCIs – Early Studies
Presenters:
Shin Kira
&
Miah Wander