appendix:guaranteed scalable learning of latent tree mode...
TRANSCRIPT
Appendix:Guaranteed Scalable Learning of Latent Tree Models
A RELATED WORKS
There has been widespread interest in developing distributed learning techniques, e.g. the recent worksof Smola and Narayanamurthy ((2010)) and Wei et al. ((2013)). These works consider parameter estimation vialikelihood-based optimizations such as Gibbs sampling, while our method involves more challenging tasks whereboth the structure and the parameters are estimated. Simple methods such as local neighborhood selection through ℓ1-regularization ((Meinshausen and Buhlmann, 2006)) or local conditional independence testing ((Anandkumar et al.,2012c)) can be parallelized, but these methods do not incorporate hidden variables. Finally, note that the latent treemodels provide a statistical description, in addition to revealing the hierarchy. In contrast, hierarchical clustering tech-niques are not based on a statistical model ((Krishnamurthy et al., 2012)) and cannot provide valuable informationsuch as the level of correlation between observed and hidden variables.
Both Chang and Hartigan ((1991)) and Chang ((1996)) provide detailed identifiability analyses of latent trees and theformer ((Chang and Hartigan, 1991)) introduces maximum likelihood estimation method, different from our guaran-teed parallel spectral method using tensor decomposition.
P Parikh et al. ((2011)), Song et al. ((2011, 2014)) are closely related to our paper. The P Parikh et al. ((2011)) paperdoes not address the structure learning. The sample complexity is polynomial in k, whereas ours is logarithmicin k. The Song et al. ((2014)) paper provides guarantees for structure learning, but not for parameter estimation.The Song et al. ((2011)) paper does not provide sample complexity theorem or analysis for recovering the latent treestructure or parameter with provable guarantees.
B SYNTHETIC EXPERIMENTS
B.1 Machine Setup
Setup Experiments are conducted on a server running the Red Hat Enterprise 6.6 with 64 AMD Opteron processorsand 265 GBRAM. The program is written in C++, coupled with the multi-threading capabilities of the OpenMPenvironment ((Dagum and Menon, 1998)) (version 1.8.1). We use the Eigen toolkit1 where BLAS operations areincorporated. For SVDs of large matrices, we use randomized projection methods ((Gittens and Mahoney, 2013b)) asdescribed in Appendix L.
B.2 Synthetic Results:
We compare our method with the implementation of Choi et al. ((2011)), where a serial learning procedure is carriedout for binary variables (d = 2) and parameter learning is carried out via EM. We consider a latent tree model overnine observed variables four hidden nodes. We restart EM 20 times, and select the best result. The results are inFigure 3.
We measure the structure recovery error via
!| !G|"i=1
minj | #Gi ∈ Gj |/| #Gi|$/| #G|, where G and #G are the ground-truth
and recovered categories. We measure the parameter recovery error via E = ∥A−AΠ∥F , whereA is the true parameterand A is the estimated parameter and Π is a suitable permutation matrix that aligns the columns of A with A so thatthey have minimum distance. Π is greedily calculated. It is shown that as number of samples increases, both methodsrecover the structure correctly, as predicted by the theory. However, EM is stuck in local optima and fails to recoverthe true parameters, while the tensor decomposition correctly recovers the true parameters.
We then present our experimental results focusing on the large p and even larger d regime. See Table 2. We achieveefficient running times with good accuracy for structure and parameter estimation.
We perform experiments on a synthetic tree with observable dimension d = 1, 000, 000, number of observable nodesp = 3, hidden dimension k = 3 and number of samples N = 1000 which is similar to community detection setting of
1http://eigen.tuxfamily.org/index.php?title=Main_Page
Huang et al. ((2013)). We note that the recovery of the structure and the parameters is done in 10.6 seconds.
C EXPERIMENTS ON NIPS AND NY TIMES DATASET
We use the entire NIPS dataset from the UCI bag-of-words repository. This consists of 1500 documents and 12419words in the vocabulary. We estimate the hierarchical structure of the whole corpus in only 15332.4 seconds (4hours). Additionally, we focus on the top 5000 most frequently appeared keywords and illustrate the local structuresin Figure 5. The running time for the subset is only 1164.4 seconds (20 minutes).
We also use the NY Times dataset from the UCI bag-of-words repository. This consists of 300000 documents and102660 words in the vocabulary. Our algorithm estimates the hierarchical structure of 3000 most frequently appearedkeywords in the NY Times dataset in only 107.8 seconds. Note that d = 2, since we consider the occurrence of a wordin a document as a binary variable. Below is a subset of the keywords graph we estimated in Figure 6. We note that therelationships the among the words in Figure 6 match intuition. For example, govern and secur are grouped togetherwhereas movi, studio and produc are grouped together. The numbers represent hidden nodes.
353
biologically
percent
charge
utterances
755
orbit
754
minimaxstick
isi
logic
719
music
html
maximizes
remove
pratt
syn
502
appeared
503
dempster
truncated
clique
hopkin
going
exhaustive
reject
email500
dramatically
iterated
753
suggesting
501
bell
maximize
mdp
topologies
supply
710
optima
todd
oam
nonnegative
principles
regularities
499
sine
minimized
dot
combining
spectral
longer
written
agree235
690
drives
phonology
subsequent
operating
school
689
psychology
pick
shapes
topographic
adjustedwta
trivial
electric504
designingdna
505
facial
variances
opposite
statistically
xor
examined
neu
try
counting
bursting
referred
corre
theoretic
preference
logistic
transmission
451
arrow
readily
chose
motivated
richard
bellman
incorporated
cut
generates
proposition
regimemassachusett
voltages
prevent
road
tdnn
arbitrarily
adopted
plateau
credit
illustrate
switch
709neighbour
hill
passed
pose
corpusanalogy
naive
integratebelieve
suite
record
neuro
356
gates
giving
908
decay
inverse
selected
characteristic
decoded
updates
speaker
give
area
194
evidence
doubling
decompose
feedforward
central
queryrisk
allowing
types
160
extracted
definite
actual
paper
114
slightly
integrated
matching
measuredtransistor
spotting
capturing
stomatogastricsizes
station
status
bayes
sparsely
later
splitting
possibly
adjusting
cancer
fitting
pure
completely
maximal
codes
member
lie
263
weighting
133
roughly
lemma
begin
reduces
dashed
allowed
classical
151
item
trick
149
sur
898
clarkextend
accommodate
unnecessary
becker
342
134
building
performing
static
affected
monotonic
settling
smc
rarely
singularities
inherently
lightness
lissom
completeness
comprised
896
previously
rotational
395
accomplished
integrates
randomness
terminates
interpolating
gardner
dent
arrive
inhibited
probabil
mirror
presumably
externally
dbn
486
justified
672
half
eventually
hit
nerve
dedicated
resolved
paramagnetic
diverge
timestep
thing
clock
std
valuable
devoted
resistance
487
120
845
repeatedly
traveling
chaos roll
comm
396
stabilize
847
darken
838
brieflypropagated
671
prentice
inverterkanerva
improves
independently
marginal
correlator
thomas
string
barto
template
837
460
faced
pictures
decaying
submitted
670
filling
demonstration
moleculesaccount
preparationdaily
seg
906835
finally
edges
morris
crude
mutation
684
drawn
symbol
square
damage
100
considering
processor
morphology
706
grammar
intracellular
colleagues
669
836
induce
282
scattered
897analytic
inspection
rigid
907144
283
plot
national
centered
112
svm
averaging
lee
divergence
derivatives
312
denker
corrected
serialpairwise
assign
move
leibler
234
weiss
captures
soma
including
common
probabilistic
233
thrun
capacity
essential
86
bases
generally
shunting
output
approx
tures
affecting
haussler
315
kaufmann
lies
located
york
778
inference
unreliable
rating
neuromorphic
537
538
536
receiving
symbolic
doesn
cornput
blood
result
learning
weight
sending
style
pfm
rel
problematic
interpreting
smooth
score
background
clues
architectures
jordan
cold
approximately
testing
draw
173741
gamma
281
190
size
tran
unlabeled
lem
smyth
semantic
meaning
partitioning
278
necessarily
watkin
539
clear
regularity
number
279
super
cart
error
problem
function
276
parameter
277
2
neuron
pattern
restriction
muscle
rev
noted
indicating
spectrogram
stopped
tablesescape
507 environmental
506
training
set
model
data
input
point
method
system
neural
algorithm
vector
hand
small
assignment
computation
optimize horizontal
provided
hopfield
type
kirkpatrick
constructed
encoding
improving
17
male
procedure
membrane
key
continued
language
integration
implemented
gradient
orthonormal
shallowpreservation
rasmussen
proximation
74
optimal
frame
signal
reduced
noisy
associative
setting
select
rulelarge
net
experiment transform
proper
grantdesigned
rogerturn
130
remain
relevant
115
sejnowski
identical
138
improve
stand
micro
magnitude
washington
121
firstly
apply
unsupervised
separation
113
producedmodule
recent
question
williamdecreases
hold
recently arm
information
interest
measurement
match volume
feed
database
quadraticlikely
advantage
structures
separate
tuning
exact
prioripark
687
neuronal
figures
connection
bound
projection
implementationproperties
references
compute
difficult
nodes
interval
supportgain
low
object
iteration
space
examples
target
obtain
dependent
basis
rbf
run
cortical
detection
pages
negative
variables
leadordermachine
multi
performed
total
multiple
directly
yieldgeneralization
science discussion
pair
timestrajectory
connected
feedback
analysis primary
paradigm
process
demonstrates
lower
roy
depict
com
rod
symmetric
mozer
28
modeling
case
artificial
optimization
13
practice
4544
required
press
12
october
97
frequency
letter
mixtures
learn
technology
relation
sensory
partvan
position
cluster
behaviorsegment
10
step
eye
index
6
foundsimilar
distinct
touretzky
firing
det
decided
programming
tend
diversity
norm
variance
predicted evaluation
generating
typically
objective
line
control
activityhebbian
confidence
underlying
unknown
future
40
form58
118
characterized
presence
restricted
document
165
216
early
stanford
sources
314
schraudolph
adding
152
principal
sampling
steve
sonn
shrink
monk
equilibrium
population
simply
smallerareas
excitatory
produce
158
column
85
ganglion
consecutive
776
nature
describe
closevalues
preceding introduce
pruning
313graphic
kindcrossover
evaluating
course
298
basic
california subset
mlp
tested
sliding
averaged
limited
investigation
60
lar
decrease
committee
injection
matches
grow
raw
vivo
weigend
robotic
subtracting
137
ultimate
ventral
fit
stimuli
weighted
val
seen
validation
encodes
santa
chaptersaturation
tap
plan
neighbor
additional
operator
significant
scheme
queue
receiver
covered
asset
baluja
argued
packet
monotonicity
80
alamos
entirely
choice
component
amount
bias
ambiguities269
cottrell
blocking
cat
eliminated
carrier bregman
numericalclassic
returned
binomial
bradtke
observe
900
priming
placement
approaching
petsche
reflectance
modification
predict
subjective
summationproof
bilinear
bauer
135
approximate
unstable
coupling
finger
asymptotically
834
competition
analyzed
nonlinearity
amplification
action
93
related
originalproc
vol
perceptron
movement
feature
differences
55
311
61
3
series
observed
density
desired
define update
nonlinear
probabilities
phase
belief
68
33
stored
continuous
35
currentcurves
compared
field sequences
human
final
57
receptive
75standard
53
814
stimulus87
advances
pixel
recognition
47
73
18
67
38
dependconvergence
base
makes
decision
reconstructiondeterministic
correspond
architecture
71
considerable
perception
270 boundary
hierarchy
gaussian
degrees
detect
build
proportion
strokes
correspondences
striking
777
concept
scene
editor
direction
104
variety
taking
window
squashingdistribution
digitend
criterionconditional
arbitrary
725
conference
coding
derived
concentrate
parsing
responsible
145
comparative
perturbed
assess
alternating
memories
ergodic
associate
neurophysiology
user
murphy
suppressed
551
stack
mrf
nealaggregate
adv
901
291
appearanceconversely
aim
recover
send
902
amherst
292
550
addressed
progress
459
vowel
michaeltemperature
140
143
394
78
mix
performance
journal
102
chaindeveloped
learner
boltzmann
synaptic
identification
122 non
character min
correlated
rules
20
instance
zipset
width
iii
solid
robust
simple
showed
alternative
spectrum
hard
sutton
varying
assume
known
stochastic
109
university
458
cutoff
808
hundred
proportional
ahmad
ensembles
translation
past
say
accuracy
direct
dept
normalized
peak
complete
mit
comparison
high
representationjoint
response
threshold
parallel
natural
adaptiveprovide
tree
report
cortexlog
difference
correlation
finite
analogenergydelay
note
active
researchpresented
study
represent
potential
path
technique
view
computer
samplebinary
term
rates
random
framework
chip
speed
individual
able
expert
constraint
descent
filter
addition
internal
memory
global
featuresestimateconsist
goal
likelihood
recurrent
source
unit
92
approach
spurious
covariance
center
shaped
john
49applied
117
processing
63
37
correctly
circuit
learnedcalled
bottom
max
saccadic
table
19
50
controller
extension
complexity
condition
find
author
student
identify
international
storage
52
defined
electrode
revow
achieved
476
theorem
curvature
ieee
changemunro
anti
discussed
vertical
maximum
cases
approximation
bounded
4
attempt
48
calculation
36
node
theory
3125
98
62
perform105
exist
trulytion
nearestprediction
applying
inversion
5
face
outcomes
demon 737
reliability
digital
stochastically
slope
deformation
confirm
sarsa
longest
smoothness
ann
widely
song
script
extent
analytical filtered
mahowald
inverting
spectra
resentation
incorporating
indices
institut
factorized
repetition
gas
pref
improvement
deviation
angular
surface
reached
89
reward
transfer
outperform
languages
cumulative
modalities
interesting
junction
jointly
sound
rumelhart
deletion
maze
hidden
470 denoted
million
equation
bishop
product
79
equivalent
assumed
9
sequential
claim
factor
derivative
retina
significantly
carlo
degree
separable
balanced
speech
42
length
66
distance
selection
inhibition
ing
morgancombination
mechanism
distributed
subject
mixture
limitoperation
word
fact
fixedtrue
regression
routine
probability
64
899
need
external
compare
purpose
structure
admit
59
actually
represented
spike
correct
propagation
allow
activation
left
world
supervised demonstrated
subtracted
thank
configuration
start
linearized
penalty
inequalities
cost
domain
704
local
array
22
27
empirical
general
search
institute
reinforcement
classes
measure
sum
images
computed
specific
larger
cross
spatial
test
82
according
trial
estimationpresent
obtainedinitial
assumptiondiscretegenerated
blockbrain
chosen proposed
zeromarkov
posterior loss
expected
contrast
resultingstatistic
scale
positive
consider
transformation
voltagetrain
states
policy
motionexperimental
equaleffect
mapping
bit
hmm
complexconclusion
motor
range
kernel
context
suggest
knowledge
minimum
important
transition
solution
matrixhigher
constant
occurpower locationintroduction
forward
orientation
element
observation
trained
dimension
computational
techniques
finding
environment
statistical
proceedingrelative
applicationrobot
channel
respect
samples
version
determined
annealing
perturbation
graph
357
studied
locally
explicitly
criticalgate
noise
dimensional
106
41
supported
melody
intelligence
randomly
mode sequentially
technical
moderate
average
21
approachesdue
rate
11
synapses
previous variable
curve
class
classifier
minimize
minimadepartment
39
assigned similarity
lateral
prior
row
abstractbayesian
141
136
23
705
88
map
142119
section
simulation65
29
34
classification
sequence
denote
expectation
cell
498
visual
786 interpolation
multiplication
123
acoust
worse
drive
dependencehebb
connectionist
294
considered
107
efficientphoneme
computing
task
cambridge
squared
revealed needed
principelosses
guidance
108
univ
meth
intractable
handledrop
meeting
utilizing
consistent
huberman
lyapunov
asymptotic
flow
implement
independent
transforming
howard
tract
towell
breast
developing
tip
templates
293
twenty
attention
evident
348
definition
boundaries
accurate
linearly
instances
widecoherence
vary
processes
parameterized
real
524
image
depicted
code
responses
normalize
generic
exp
piecewise
polynomial
partitioned
selective
reported
converges
32
effective eigenvalues
adaptation
phenomena
latticeoverlap
839
logp
visit
conic
843
310 interpretation
proves
cam
watson
danielkuhn
mcculloch
846
emgorganization
inactive
maximally
estimator
invisible
dim
vision
choices
conjugate
833
substituting110
solvesilicon
ten
moore
functional
cichocki
relaxation
texture
cal
development
ability
393
insight
chapman
conflicting
conversation
auditory
bind
456
easily
cursive
estimates
738
detail
juang
denccd
ca1
fly
735
explained
pittprobe
16
stripes
exploited
shrinkage
skin
599
eric
proximity
avoiding
fitness514
cohen
uniform
simulate
cessingroutines
384
cos
goodman
rotatingrepresen
736
kowalczyk
server
kurtosis
549
resistive
greatly
heavy686
infomax
transmitted
explain
333vii
521uniformlymillisecond
landspan
european intuition520
groupwhite
temporal
levelsimulated
matricestheoretical
follow
free
loop
dynamic
stable
establish
single
vlsi
head
fig
595
wish
leaves
kinetic
laird
617
warp
522
includes
violated
375
validated
extreme
special
financial
exploit
mdl
ties
force
596
523
685
sigmoidal
interaction
upper
acyclic
soft
linear
461
ignoring
huang
windpartly
spatially
pattem
302
fully
instantiation
acquire
principled
literature
gradually
840artificially
passenger
309
clause
128
175
green
polynomially outgoing
mistakes
suffices
loaded
739
usefulness
308advanced
locking
tactile
429erp
880469
surround
discovered
thc
cue
scenes
poggio
attempting
thousand
monte
recursively
lrta
largest
constrained
black
tel
arriving
antennal
periodically
telecommunication
progressively
efficiently
tal
tomita
predictor
suppress
neighborhood
codebook
settle
316translate
margin
844
murata
family
jackel
broken
synapse
perfor
405
lesion
sfmd
text
reflexrecorded
76
num
153
517 161
dendritic
closed
mutually
406
look
texas
media
mis
tesauro
policies
interneuron
activated
797
894
gating
articulation
constructive
185
schapire
success
analyse
summaryresultant
surgeon
profit
vehicle
chunk
drastically
err
dendrites
dof
regular
biased
903
iid
gene
conductance
halves
maxi
extensively
davis
legal
nelson
attractiondegenerate
actively
779
discount
deg
informa
hood
852
linked
855
permanent
414
lexicalcreation
shattered
lsa
tuned
histogram
lid
515
cho
determines
quality
853
competence
centres
files
306
purposes
late
linsker
simultaneously
glass
relationship
prove
triggered
limb
discuss
169
extra
understanding
sition
spie ment
box
propagate
nsf
bond
signed
medical
suffer
schedules
royal
animation
refinementconcerned
namicnervous
317
led
mosis
lookahead
accounting
ackley
triangular
movshon
occasionally
highlight
held
895
air
101
continuum
capable
separately
organized
versa
magazine
profiles
residual
rigorous
phenomenonremoving
kivinen
785
numerically
grey
sort
tresp
sphere
trading
temperatures
interpolate
identically
decide
harris
default
illumination
466
gmd
tic
lewis
episode
fair
constraining
boosted
sensing
skillquantitative
856
baltimore
eng
belonging
entries
pen
north
ahead
smoothed
passive
sherrington
anastasio
modularity
587
oct
artmap
mason
unity
sloan
secondary
segmenting
arrives
413
harvardschmidhuber
803
518
gathered
hearing
801
802
extractor
triangle
exponentially
232
530
condi
modelling
triple
umli
dyna
easy
heterogeneous
edelman
shot
decomposition
nearby
fund
516
851
incrementally
infinity
alexander
clutter
communicate
experiences
857
inferred
big
criticgibb
remained
531took
yang
interconnected
epsp
closely
uniquely
regardless
905
velocities
tim
tipping
tical
fraction415
olshausen
topology
rep
replace
opponent
united
calculates
neighbourhood
eigenvector
occam
neocortical
carrying
plotted
cardinality
707
replica
coordinate
project
manifold
detector
approximates
infinite
bower
monitored
determining
pes
162
cellular
appendix
integral
resulted
diameter
animal
evolutionary
synchronous
fail
varied
fol
presynaptic
barron
225 implies
231
stacked
simpler
recognize
171
782
bus
construction
preserving
foedyer
triangles
858
sweep
greedy
wolff
sorting
wait
hanson
invariance
850
list
mackay
ascent
adelson
598landscape
icassp
flash
literal
iff
periodicity
manuscript
mcnaughton
hyperbolic
pdp
fill
613
persistent
clustered
labelling
memoryless
travel
614cti
eligibility
jet
681
degraded
macaque
evolve
716
advantageous
rectangles
niranjan
invariances
invertible
picking
admission
blum
bility
academy
protocol
converged
372
373
centroid
characterize
pact
maass
lip
verlag
spacing
preserves
sponding
stimulating
schaal
restart
award
styles
shawe
compucones
683
supporting
thermal
request
marr
lyon
embedding
merzenich
682
rectangular
measurable
intra
nist
imple
programmed
717
locked
hamilton
replacing
hornik
discovery
conditioning
mance
job
fabricated
cumming
crl
nent
motoneuron
multiresolution
strategies
normalization
backward
spinal
induced
203
origin
evaluatedevaluate
possess
occluded
prone
normalizing
371
occupancy
nonstationary
829
jan
highest
warping
hansen
liuunlabelled
resolve discrimination viewinginfluence
obvious
leading
268
magic
molecule
acknowledges
vertebrate
658
swimming
realizable
slow
ensure656
specialization
watrous
tial
ming
judgment
levy
lnpminor
804
luminance
207
judged
baldi
simulator
657
art
479
additive
foveadurbin
axes
choosing
geometric
consisting
failure
sulu
destinationdifferentiating
electrodes
discontinuous
similarly
796
488
balancing
considerably
334
fold
alspectortransaction reasonable
focuses
smoothing
advantages
floating
417
review
800
syntactic
dimen
164
weather
jump
intensive
modulation
stability
hasselmo477
bipolar
formally
474
city
entrystrictly
emergenceassociation
intermediate
hertz
meir
gtm
fragment
relies
respective
involving
klaus
781
472
methodology
cord
stated
company
basin
logical
convention
digitized
799
463
separating
surprisingly
argument
783
albus
analyzing
stronger
scheduling
snn
service
turned
striate
reproduce
scan
advance
matter
qualitatively
465
concentration
specification
464
exposed
peripheral
primitive
inset
carries
utterance
784
meaningful
filled
tsitsiklis
softwareamenable
enhance
dominant
aic
reading
piece
disease
font
dec
challenge
geoffrey
cowan
approached
831
versus
squares
eliminate
zone
650
529
alex
floor
534
sim
aid
sustained
pointing
cast
actuator
transduction
failed
stat
528characterizing
baggingsunspot
routes
trigger
lda
robinson
854
kaelbling
aviv
iteratedarrellexclusively
hip
inst
excessive
synergy
bin
358
volker
ance
modify
832home
systematicallythresh
align
sharing
naval
ridge
surprising
instruction
suzuki
iti
fre
cruz359
establishing
dence
deliver
omohundro
pac
graphical
mul
pat
essen780
slower
tried
655
membership
obd
multiplicative
mel
predisposition
preserve
412
parietal
hamming
attained
distinction
contrary
798
dopamineplanar
411
guide
cybern
reasonably
oja
justification
initialization
negligible
423
distorted
knowing
771
november
placing
emergespotentiation
588
kramer
654
japanese
mod
fication
coherent
leaf
asymmetric
wang
geman
logarithmic
fan
conf
virtual
motif
purely
715
621
713
inclusion
mini
resonance
neuneier
excited
interior
concert
december
cylinder
cube
differently
eeprom
714
inconsistent
axe
cmm
623
beam
applicability
janus
amino
admissible
622
han
711
rose
naming
werbos
complementconnoracting
capacities
informatik
neurophysiol
shortest
ter
mental
motivation
hippocampus
cooper
elastic
ambiguity
422
inversely
pairing
odd
eqn
reasoning
bad
calcium
dictionary
retrieved
obstacle
lost
424
chang
chorale
651
893
contextual
access
653
454
597
unconstrained
904shadmehr
extending
elementary
successively
blocked
continues
426
timing
quantization
maintaining
colliculus
batch
cdm
facilitate
backup
lecun
subspace
imposed
caruana
article
primarily
fft
englewood
drifting
712
bold
enhances
788
format
initialize
mann
axonal
nasa
bregler
lond
iteratively
hme
ave
409
410
att
enhanced
analyzer
cone
408
489
806
pott
abilities
recognizes
canonical
published
491
expanded
ecc
january
dissertation
losing
453
vertices
saferiemannian
stonerare
doom
messages
coordination
n00014
204
acknowledgment
correspondence
computes
830
patient
grouping
accomplish
hubel drain
bss
heading
atick
flat
peterdiscarded
delivery
plasticity
antenna
aggregation
doubleinterpreted
workstation
satisfying
reality
reset
490
recurrence
exposure
rent
render
repository
ally
saturatingscenarios
satellite
quick
brainstem
vote
violation
gainedbank
grouped
tor
gorithmalgorithmic
425
coincide
robustly
skeleton
spiritimplicitly
radar
lation
547
proach
resolution462
discriminate
rotation
alter
illustration
saad
manipulator jij
laplace cartesian
combinedpre
label
72
mos
writing
eliminating
clustering
estimated
constitutes
main
7
sanconservative
frasconi
bution
exam
pressure
despite
multiplied
bring
519
chess
gluck
540
pieces
distances
berkeley
superior
designsplit
hinton
affect
452
69
hull
develop
290
inhibit562
stork
zipser
register
date
smola
converted
superposition
tail
hierarchies
589
24
ratio
pij
rough
remark
546
hofmann
dependency
setup
enable
handwritten
ensureshart
peterson
collective
merge
controlling
israel
734
sonar
acquiredfactorization
wahba
593
assembly
intuitively
convenient
ranges continueregularizer
knncoloradocontinually
442
predictive
flight
dialogue
mapped
giles
procedures
photoreceptordecomposed
execution
saccades
splines
databases
realized
cerebral
votingnotion
pan
affine
693
greater
heart
agent
gaze multilayermhz
generalizes
457
exploratory
variation
remaining proba
snr
201
592
top
period
minimizing
spikes
70
fall
249
rich
specifies
reliable
leg
gray
coin
modular
maximizing
473
arisingcollect
clas
door
los
pointed
rapidly
sleep
salient
surfaces
forecasting
pro
arbor
401
adaboost
echo
gridextraction
minimal
payoff
301
parsec
bertsekas
capturedaccept
548
emotion
roc
77
outlined
importance
feeding
buhmann
cia
ben
572
modelled
holland
perceptton649
pathway
280
305174
emission
cpu
monitor
remote
littman
vestibular
222
633 lowe
likewise
candidates
corner
belong
overtraining
gen
fed
create
capability352
167
satisfied
conducted
composite scientific
kinematic
841
expanding
induction
phd
die
angle
pull
sentences
fine
shared
sufficespreadimplementingtechnologies
416
causing
construct
dis
46
reflect
147
faces
dynamical
kohonen
stream
board
routing
shift
154
video
spin
510
similarities
756
509
lvq
polarity
rollout
defining
criteria
197
374
online
multiplying
508
phonetic
unlikely
limitation
552
usage
205
constitute
boser
behaves
repeated
verification
determine
derive
51
especially
requires
connec
communication
15
latent
count
optic
slot
schemes
511
continuously
extracting
choose
contour
273
place
blind
quicklystarting
situation
illustrates
274
crucial
isolatedcognition
initialized
vlvp
middle
annual
leave
182
efficiency
identified
densities
bengio
refer
stop
against
acoustic
116
hall
command
exhibit
effectively
275
andreas
akaike
believed
acknowledgement
agency
quantifywatanabe
spline
diego slowly
scaled
286285
message
comparing
int
baseline
compression
183
benefitfaster
degradation
happen
poisson
332
vari
anal
associates
apparent
flip
approxiwilliamson
610
crosses
ation
attentional
competing
jaakkola
ranging
kluwerguaranteed
encountered
von
trying
include
solved
diagram
suppose
formulation
altered
flower
discounted
dendrite
truth
concern
introduced
regarded
maxima
assuming
steven
treatment
rejection
stimulated
suffix
mem abstraction
ion
route
tolerancelogarithm
dow
eigenfaces
falling
evolved
fimction
informative
argmax
bumptree
bach
765
467
beneficial
derivation
instantaneous
grammatical
winnow
irrelevant
adopt
accumulated
assigning
switching
suboptimal
served
suggestion
siu
sym
scanning
tensor
rivest
sontag
proposes
russell
reversed
rcc
757
addressing
histories
requiring
794
869
poorlyaperture
peaked
tance
viterbi
occlusion
interference
sgn
rhythm
795
418
864
scores
sanger
oscillate
changed
482awij
jones
399
fires
primate
constructingexpress
226
lack
callyexplore
indirectrefractory
grating
impact
layout
somsummed
forces
forcing
fittedengland
analyze
emerge
attributes
analytically
731
overlapping
sci
gated
dropped
kawato
dominates
adapt
diverse
employed
tem
553
conversion
converging
convenience
working
intrator
created
215
forced
contributes
core
comprising
prototype
separated
814
diffusion
somatosensory
synthesized
simplifies
245
trainable
composition
tones
deleted
normalisedperiphery
619
rain
thick
alphabet
403
engine
envelope
numerous
obey
pointer
das
page
nettalk
acm
barn
multivariate
patch
259
339
acad
satisfy
circlesfreedom
244
consideration
completed
677
alternatively
mauthner
phases
modifiable
complication
multimodal
genes
carried
clump
discontinuities
determination
mainly
clamped
significance
compensation
vanish
272
421
permit
ber
724
saul
effectiveness
depression
automata
mark
tuple
biol
res ipsilateral
irvine
marshall
summing
loosely
manually
419
homogeneous
impractical
397
mcdonnell
hemmen
huxley
vocabularyvariability
conditioned
james
concatenatedinspired
accurately
spa
shaping erlbaumsis
arrangementexecuted
alpha
senses
483878
cation
loo
konishi
theoretically
electronic
lit
phonepostsynaptic
mjolsness
identifying
877
buffer
chooses
inferior
insertion
francisco
766
ghahramani looking
dif
gaborphysiol
389
smallest
narrow
398
228
border
243
kth
666
classi
argue
christof
vanishing
ordinary
791
symposium
767
uno
stock
ylxsiemen
420
convolutional
publisher
plant
hat
sollich
fsm
perfectgratefully
825
multilinear
826
lying
magnification
meant
levin
chervonenkis
ignore
hair
bulb
chicago
sufficiently
color
partially
neocognitron
relational
statist
eegmerged
classifies
multidimensional
rational
qualitative
boxes
493
assumes350
automated
fahlman
applies
beat
351
cij
merging
699
red
cauwenbergh
cliff
catastrophic
tile
bat
assemblies
boulder
cohn
obstacles
blue
articulatory576
575
577
scheduler
signatures
quadrant
tive
mimic
round
oculomotor
guess
geometrical
eral
deweerth
824
singer
823
desirable
mentioned
mach
neglected
steering
870
relate
objectives
minsky
sized
reciprocalready
627
maxq
microphone
265
lsi
770
modern
utility
nih
broad
easier
restrictive
tation
edition share
dilation
816allocation
349
spoken
366
identifiesreveal
rise
waibel
deficit
foreground
hinge
transformed
ild
492
counted
266
thought
728
parity
bridge
appealing
chaoticbarber
betaappearing
lab
361
management
362
239
620
subnet
adjustable230
proposemultiplier
accumulation
substitution
600
deconvolution
worked
dayan
848
boosting
stuck
correction
199
goes
257
analogous
quantity
noting
portfolio
afferent
retained
fluctuation
maintain
layered
258
router
scalar
mixedaverages met
fourier
frequently
618
363
memo
inductive
sides
252
ity
equivalence
abeles
counter
bartlett
lippmanfood
noiseless
wire
decoderrelease
aircraft
fixation
displacement
862
250
track
484
interpret
exploring
325andrew
kalman
overfitting
tracker
bonn
hypotheses
toronto
evolves
cmac
specializedpenalizes
scott
891
regard
reconstructed
441
scoringcollection
passes
227
324
187
perfectly
neurobiology
infer
compares
727
schedule
governed
contribute
competitive
adaptively
contained
726
amplitudes
402
searching
radius
phrase
480
disjoint
visualization
artifact
autonomous
token
visually
388fet
876
rapid
229
scatter
littlestonemation
formingtransparent
essence
gesture
imaging
govern
868
basically
543866
tween
realistic
soc
distinguishedadjustmentpractical
icapulses
compressed
horizon
representingorganizing
focused
making
432
think
constancy
structured
recall
circuitry
approximated
sensitivity
epoch
alignment
863
biases
behaviour entropic
moving
yellow
tone
quantized
566
retraining
distinguish
saving
voiced
swing
svd
param
replication
664
zoubin
physic
telephone
built
convolved202
experience
physiological
56
sam
195
removeswise
notation
side
employ
mateo
dependencies
seek
wilson
lated
542
steepest
545
transducer
intact
focusing
vmos
grbf
433
toy
unified
444
february
connect
chen
ron
360
spontaneous
spherical
check
96
focus
band
heuristic
expansion
cerebellumdirected
locomotionpitch
875
reader
player
431
289
analoguepseudo
achievable
age
gave
partial portion
rely
spot
pasadena
phonemes 392
helpful
graded
673
retinotopic
recursive
optimality
wrong
adapted
wavelength
pouget
909
675
oregon
radio
oscillating
housing
recursion
ordered
treat
665thresholded
oxford
psychological
minimization
occurrence
volt
transparency
tit
157
recovering
schuster
tenth
tain
respond
sparseness
equa
totally
rivalry
algo
putation
raises
892
primitives
626
827
differentiate
156
865electrophysiological
enforces
retrieval
existing
der
effort
reviewed
248
conduction
opt
tenenbaum
thickness
textures
811
mistake
modulated
person365
triangulation
810
neurobiological
stores
water
809
shop
preprocessing
limiting
828
tool
stm
364155
surprise 267
closest
zador
planning
programmable
ray
realization
reproduced
tively
restricting
siam
uniqueness
diabolo
approximator
eighth
dean
cauchy
agreement
787
detecting
upward
exon
recovery
789
constrain
offset
voice
acetylcholine
eliminates
mass
predicting
categorical
confined
874
functionally
pearl
particle
rangarajan
characterization
convolution
anomaly
perhap
pronunciation
positives
872
latency
prune
separates
unimodal
337
traffic
478
793
tsoi
screen271
protein
hodgkin
fashion
859
871
predefined
poly
860
combinatorial
microscopicchanceautomatic
lowest
correcting
workspace
curved
kailath485
goodness
ltp
fourth
studying
pollack
tries
monotonically
trol
242
moment
dealing
children
pronounced
pop
rosenberg
psychophysic
intelligent
addresses
breiman
pearson
propagates
organism
hpnn
outlier
fabrication
repre
400
carpenter
drawbackoperational encoder
formant
voronoi
feasible
defines
anterior861
receive
pbil
unusual
transpose
lag
diagnosis
systematic
zhang
indicator
247
paris
goldberg
625
analyses
augmented
provension
gold628
inde examine
vanishes
fisher
baxter
investment790
708
localized
vertically
issue
simplified
saliency
191
eyes
221
wileyinter
fault
cmos
transcription
stimulation
technion
received
answer
intrinsic
investigatefiber
insert
220
intrusion
prob
intracortical
176infeasible
guyon
taylor
est885
720
navigation
generalisation
matched
included
profile
moved
existence
629
wolpert
add
bourlard
articulator
plane
186
synchrony
nonparametric240
observing
receptordotted
dynamically
parametrized
recollection
pay
867
contract
873
conjunction
eeckman
physicallyinducing
game
tended
ous
gap
tube
downward
singular
849
recogni
straightcrossing
burst
reach
absence
sch61kopf
nowlan
468
asynchronous
decoding
coarse
growth
receives
fundamental
stereo
703
learnt
turning
severe
judd
tinunable
amplifier
consonant
subscript
omitted
edited
ballard
contraction
238
241
neuroscienceexploiting
uni
intended
displayed
successful
307
generatordivision
mixing
demand 842
capabilities
lippmann 287
established
ranged
455
relaxed
188 tern
tracking
ideal
deltaacceleration
pulse
ary
rotated
kullback
hessian
factorial
spatio
returnhastie
security
cholinergic
preliminary
pool
879
tishby
contact
integrator
prop
book
fish
gilbert
act
experi
replaced
harmony
notes
odor
tag
583
tour
rank
701 ullman
describing
creates 237
body
movestom
updated
tractable
752
executing
echoes
elevator
hamiltonian
interaural
hypothesized
reflected
hoc
sweeping
prefer
vor
graf
union
incoming
700
picture
classifying
biophysicalrequirement
iterative
decreased consistently
driving
euclidean
inverted
recognized
chunking
cur
covariances
584
indication
328
bag
sharp
dfa
queries
beginning
hilbert
comp
disturbances
compo
749maryland
rectified
mfa
meg
modulatory
890
ftp
ltmhubbard
adversary
translated
welch
manual
walk presentation
backgammon
329
production
empty
668
557
compete
switches
751
750
superimposed
thermodynamic
familiarity
unseen
boolean
symmetry
duration
generative
354
height
springer
196
notice
genetic
applicableebnn
want
device
detailed developmentaldiscretized forgeries fraud
159
adam
arg
frequent
tspprocessed
expressed
distractor
fewer
eigenvalue
dataflowdelayed
327
236206
proceed
compartmentsity
slices
323774sen
hyperplaneanatomy
splice arpa
amacrineamerica
812
assistance
impose
discontinuity
dominated
eter
discriminative
dominance monkey
utilize
322
categorization
regulation
priority
563
326496
exceed
intensities
586
equality497
qos
maintained
parzen
execute
distribu
seed
fellowship
car
diagnostic
employingedinburgh
neurophysiological
hyperparameter
diamond
maximized
marker
cpg
aston
isotropic
observablebinding
composed
singh
simplicity
put
traditional
recognizer
jacob
blob
accordance
anna
distri
678
landmark
579
passing
578
guided
ease
frontal
olfactory
discretizationdiscover
labeling
robustness
interconnection
kearn
opper
624
emphasizednames
tune
580fea
atkeson
cmu
empirically
elman
immediately
331
flexible
load
interact
microstructure
mismatch
multinomial
copy
malsburg
lot
selecting
post
526
680
char
apg
679
const
house
cubic
chris
decorrelation
talker
bridle
switched
dense
counterpart
consequences
tension
afosr
arc
category
prioritized
mpc
663
oscillatory
nested
capture
additionally
pos179
earlier
wall
interpolated
81
idea
cycle
nat
90
gerstner
annealed
arrangedrecovered
192
terminalmechanical
australia
usa
false
lapedeslawrence
regarding
introduces
attractive
branches
climbing
weakly
info
607
608
simplify
funded
foundation
typical
dark
ideally
consequence
comes
525
dividing
fired
ocular phy
german
129
depending
hierarchical sec
609summarizesadded
edge
expect
relates
insufficient
54
conventional
oscillator
pruned
pass
industrial
influenced
exactly
firmpca
utilized
ucsd
disorder
episodes
attached
contralateral
august
exhibited
nonlinearities
onr
realize
rubin
poses
preserved
july
finer
561
attempted
concentrated
economic
asked
pub
hopf
hid
street
speedup
speaking
phonemic
outperformed
subsequently
planes
observables
parti
718
527
possibilities
deep
582
ple
581
heavily
handling
94
chauvin
clauses
valued
dimensionality 695
site
modes
miller
ca3
reason
639
clinical
reduce
epsilon
166categories
604pomdp
broadly
radial
australian
isodeterminant
differing
303
ramtermination
neumann
296
640
hancock
summarized
leaming
attribute
commonlyopti
fsa
mathematic
correlogramoptimisation
devices
relevance
helmholtz
labeledwavelet
geometryedward
boston370
638
573
increase
eigenfunction
listed
willshaw
rewritten
attractor
monocularparse
topography
corrupted
181
layercontribution
mean
rissanen
overhead
quasi
permutation
jose
paul
port
hypothesis
worth
surveyprogram
sparse somatic
reflection
society
cycles
metricamplitude
uncorrelated
allen
behave
propagating
con
tangent
address
reduction
modulesdavid
sciences385
invariantfactored
formulated
enhancement 495
interface
gestures
coolen
mostafa
globally
383
alway
causal
imply
equivalently
call
stage
geiger
signature
compensate
entropy
exception
convergent
anatomicalconvex
smoothly
yale
208
infor
representative
resource
jain
fusion
acknowledge
535
deriving
bialek
annal
exponent
normal
125modified
connectivity
optimized379
require
naturally
sure
scaling
636
parameterizationguarantees
bandwidth
examining
163acquisition
limyielding
zemel
567
func
gordon
168
increased
262
30
enables
sign
synchronization
presenting
germanywesley
traces
irregular ture
viewpoint
popular
caltech
differentiable
ral
converter
iiii
achieve
448specifically
376
ensemble
tracesteadyvalid
actor
hybrid
redundancy
avoid
sensitive
261
moody
front
individually
cole
obermayer
cope
separator
676smo
stay
dax
882
vice
565
bounding
terrence
subnetwork
shannon
imagine
consumption
valiant
288
picked
stress
quinlan
tanh
carefully
conclude
378
tell
cooperative
confusion
544
fuzzy
yuille
temporally
390
suppression
encoded
rat
cascade
660
promising
smithdog
adjacent mar
mechanic null
session
stem
directionalowl
excitation
198
corporate
preferred
146
295
inherent
generation
exemplar
circular
clearly
operate
arithmetic
price
frequencies
661
berlinmea
establishes
representa
frames
electricalmolecular
capacitanceformed
scenario
284
expensive
audio
615
rectangle
uncertain
reaction
vec
stroke
insect
macroscopic
substantial
exper
fiat
challenging
demonstrate
diag
commercial
drift
roles
tied
elemental
largely
regularized
564
multiscale
jacobian
616
diffuse
evoked
weak
computationally
integer
driven
perceptual
391
reversal
605
properly
reverse
494
redundant
searches
formal
burges
pyramidal
biology
independence
highly
formula
universal
optimum
perspective124
synthesis
sensor
projected
prototypes
355
implication
perpendicular
tank
outer
mobile
koch
hippocampal
www
381
mind
347
neocortex
planck
mouth
slight
meet
solla
346
ranking
distal
markovian
533
controlledacid
achieving
activities180
380
367
ask
minus
formance
storing
geniculate
regimes
rnn
takesmagnetic
engineering
calculated
nec
trees
ocr
cybernetic
branch
26
805
movellan
understood
incorrect
concerning
leenharmonic
description
generate
college
transmitter
91
193atr
expression
vestibulo
link
hardware
exponential
ith
comparable
377
moss
estimating
room
335
softmax
trade
extended
consistencyhelp
studies
83
velocity43
shape
measures converge
percepthour
84
statement
http
sophisticated
girosi
adapting
experimentally
solving
segmentation
dorsal
missing
event
denver
637
fast
repeat
emphasisauxiliarysetpoint
mediumconflict
uci
producingpatches
coming
curse694
mathematicalsites
covering
nonzero
thinleaving
554
douglas
aij
lookup
sigma
depth
burr
lett measuring
mitchell
reconstruct
729
ble
tibshirani
ward
intensity
subthreshold
store trajectories
generalized
incorporate
oscillation
partition
fairly
sense
entire
sufficient
bar axis
aspect
month
son
child
var
spanned
568 probable
discharge
ent
validity
segregation pomerleau
coded
corrective
semi
describes
japan
teacher
diagonal
shorter
machines
differential
majority
sigmoid
439
hyper
incorporation
318
excellent
ambiguous
monotone
injected
574
ferent
643
alternatives
role
sig
careful
sketch
increasing
surrounding
sentation
555
570
nition
minimizes
principle
improved
mammalian
556
algebrainfinitely
plate
community
645 aaai
ultimately
convert
computa
certainlywake
adequate630
tanaka
weaker
regularization
english
potentially
completion
architectural
tau 733434
acceptable
accounted569
solely
verify
decreasing
accepted
avoidance571
dichotomy
murray
wiesel
pole
simultaneousdichotomies
299
594
addison
algebraic
combines
expense
female
sompolinsky
keeping
connecting
balance
elsevier
performances
aligned
blake
541yes
652
identityincreasingly
ratios184
physiology
essentially
smoother
strict
vapnik
471
serve
material
300
law
449
427
slicedolphin knot
explicit
barrier
intl
freundcomplementary
nucleus99transconductance
resources
stopping
644chordimplicit
tissue
market
writergener
suitable
ure
172arise
647
sheet
interspike
428
ring
631881
189
multiplysung
influencesspaced
raised
exclusive
proved
copies
started
certainty
birmingham
532
336
retain
cochlear
pyloric
grossberg
ball
conceptual
512
occurrences
769
conditionally
443
breaking
restoration
specificity
incomplete
consisted
resampling
star
freeman
zip
hyperplanes
doya
ence
produces
encouraging
768
generalizing
potassiumshadingobviously
perceived
mask
lane
overview
95
divide
overcome
430
bead
702
assessed
kaufman
stephen
386
understand
princeton
investigated
graduate
170
platt
extract
discovering
pin
drawing
lighting
council
subscriber
solves
fom
tradeoff
successfully
indexed
termed
struc
646
grateful
thresholding
rabiner
purkinje
locate
632
misclassification
correlate
bird
orr
varies
salk
omit
onset
achieves
majoractivate
interested513
ill
treating
343
ongoing
sun
345
avoided
species
344
comment
platform
outcome
inf
ideas
palmer
autoregressive
embedded
formation
nil
generarive
743
laser
merit
modifying
mining
ponent
unfortunately
winning
indepen
lisberger
shepard
induces 742
summarize
torque
apparently
precisely
wider
read
functionality
bear
austin
andreou
698worst
fir
anthony
bottleneck
emergent
bootstrap
capital
verb
cumulant
bottou
distinguishing
bracket
763
vis
generaliza
assessment
incorrectly
fin
nuclei
fractal
304659
ear
jabri
uncertainty
equalizer
150 enforce
spiking
boyan
696
massively
bienenstock
mcmc
fox
encode
884
played
inaccurate
883
investigator
hole
297
stages
dataset
physical
refined
585
publication
originally
adjust
simard
martin
recog
posteriori
tij
insensitive
lagrangian
learnable
enter
674
noun
satisfaction
esti
loading
excess
127
shavlik
softassign
simplifying
641
matlab
replacement
evolution
indexing
neglect
rochester
psom
910
preferences
ini
rejected
1 inform
744
predictable
auto
marder
indi
vocal
responsibility
748 andersen745
architec
september
friedman
london
126
preprintplaces
606
projecting flexibility
parallelism
ritter
binocular
691
alvinn
spite
692
utilizes
benchmark
722
soon
discriminant
showingvariant magnitudes
813
art2
metamackey
200
positioning
letting
340
brown
complicated
collectedgrowing
sian
penalized
pairedequally
nip
246june
plausible
rao
academic
611
pentland
physiologically
laplacian
handbook
disturbance
substitute
255transmit
electron
tokyo
256
modal
gram
pearlmutter
253
ground
msedifferentiation
lobe
france
481
740
concrete
compositional
tech
characterizes
ization
harder
muscles
collision
gauss
capacitor
alternate
merely
houk
west
gence
automaton
lecture
id3
codeword
families
rewrite
remarkable
ically
cover
hyperpolarization
exploitation
iiiii
optimizing
consuming
solvable
calculating
classify
lazzaro
cooperation
dirichlet
math
ijcnn
holding
anderson
distortion
hint
extremely
successive
792
254
amari
correlates
wij
talk
sided
classified
177
observer
recording
johnson
strongly
chemical
circumstances
264
131
branching
approximating
vertex
697
structural
precise
stereogram
132
spaces
manipulation
widrow
hughes
rithm
bandpass
discusses
mellon
examination
failures
percentage
demonstrating
precision178absolute
quantities
251
corn
propulsion
quantizer
learnability
817
spring
591
host
log2
821
linking
ionic
jitter
jersey
implying
write
sees
723
sensorimotor
generalize
sin
providing
rest
incremental
parametric
wave
automatically
updating
lowpass
820
illusory
822
poor
imation
visible
parent
spiral
usual
alan
satisfies
selectivity
dual
mcclelland
variational
oriented
maximization
orthogonal
root
axon powerful
213
difficulty
sampled
young
223
momentum
475
bifurcation
unlike
rtrl
attributed
746
scanned
sake
721
764
subsystem
saturated
762
scalp
agrees
snapshot
supplied
761
instability
push
medicine
isolation
raammarcus
nontrivial
760
opinion
peginferring
selectively
responding
subtask
synchronized
tendency
simplification
612
generafive
260
819
transient
offer
neighboring
explored
office
possibility
lin
combine
panel
224
issues
pursuit
214
818
nearly
219
waveform
gaus
devised
denominator
deter
bergen
815removed
difficulties
inequality
lagrangereaching
minutes
404
linearity
simplest
familiar
linguisticmedian
alarm
impulse
impossible
detected
guarantee
ignores
horiuchi
horizontally
hrr
explanation
sinusoidal
interestingly
attended
comer
rhythmic
jolla
ate
139
758
evenly
property
holmdel
unique
edu
111handled
suited
ory
cognitive
incorporates
expand
xly
optical
scales
hypercolumn
369
inefficient
unbiased
week
642
reversible
ruppin
practically
shaded
successor
unpublished
ruderman
charles
cochleabreak
substantially
circle
330
cavity
natlcentre
distantplasma
arrivaloutline
pittsburgh
ples
889
henderson
368759
cancel
730
tonic
747
impulses
calibration
fat
localization
212
211
involved
divided
cues
workshop
arises
posed
automation
341
iri investigating
carnegie
option
gender
217
210dudadisadvantage
misclassified
disparity
periodic
history
218
baird
involve
american
conductances
markedintroducing
treated
modality
mead
lgn
nal
stationary
interacting
viola
virtually
proposal
phrasesadult
nonetheless
descending
implied
dsp
sta
getting
648
knudsen
formulae
thesis
pyramid
researcher
behavioral
387extensive
ran
mai
lewicki
keyword
888
fix
restrict
recognizing
tional
region
lines
rameter
playing
allocated
library
saddle
relating
simplex
illinois
laboratories
spreading
legendre
sentenceinside
robert
deal
formalism
file
serves
807
king
neurosci
increases
732
shifting
planner
ping
435
resppami
nmda schulten
coupled
887
abbott
mst
pineda
445
sebastian
carry
biological
436
illustrated
syllable
634
topic
caused
msec
319
straightforward
strategy
candidate
remainder
occurring
angles
verified
fire
topological
formulate
calculate ical
retinal
shiftedconjecture
filtering
modeled
jack
886
denotes
scopereliably
103
optimally
obtaining
jth
warmuth
inhibitory
mori
hope
content
intuitive
randomized
corollary
satisfactory
dietterich
win
437
backpropagation
eigen
april
newton
interactive
simd
drf
amplified
initially
dtw
407
energies
coefficient
east
fulldrug
exploration
manner
referencestrong
corporation
558
differ
generality
acousticallang
resembles
accumulate
subspaces
emphasizesynthetic
reinforce
light
688
438
662
fortunatelypendulum
display
theories
tributionresting
smolensky
autocorrelation
visited
resistor
saccade
backprop
appropriately
635
338
slopes
coordinates
tight
representational
labelled
773
krogh
operates
320
440
brightness
confirmed
autoencodercare
baum
compact
ordering
barlow
reflecting
convey compatible
seattle
trend
favor
denoting
bill
psychophysical
creatingdarpa
cleanreplicated
christopher
bump
canada
laboratory
ekf short
mutual
occurred
handwriting
rohwer
590
camera
running
waves
446
elimination
450
321
601
summerseung
horn
syllables
closer
spatiotemporal
exchange
667
603
schwartz
reducing
yielded
148
printed
opposedsimulating
datapoint
increment
strength
intersection
hypercube
constituent
segmented
cosine
play
khz
unchanged
endpoint
card
importantly
musical
hassibi
mediated
integrating
hough
comparatordom
focal
canal
fastest
compensated
clarity
cun
feasibility
chick
removal
games
monitoring
viewed
775
hillsdale
447gammon
upright
ibm
cerebellarefficacy
denmark
suggested
dramaticfij
continuity
march
sharply
disappear
772
mathematically602
jutten
master
abu
schematic
209cor
reaches
open
mountain
frey
involves
559
par
personal
560
winner
changing
382
symbolsymbolsymbolshuntingshuntingshunting
givebases
feedforwardfeedforward
centeredcenteredcentered
leibler
matches
temperaturetemperaturetemperature
committeecommitteecommittee
kind
proportionalproportionalproportional
158
85
776
transfer
sequentialsequentialsequential
strokes
834
bilinear
violatedviolatedviolatedviolated
worse
validatedvalidatedvalidated
123
cambridgecambridgecambridgecambridgecambridge
genericgenericgenericgeneric force
com
mozerpiecewise
partitioned
Recovered Latent Tree
“Training” Neighborhoodas in the blue dashed
circle
Figure 4: Estimated NIPS top 5000 keywords global hierarchical structures. Red nodes are latent nodes introduced. The blue dashcircle in the global hierarchical structure is zoomed in which is the neighborhood of word “training”.
result
output
weight
unit
error
problem
1
method
number
critical
training
model
vector
neuron
input
function
system
set
data
pattern
neural
network
algorithm
parameter
learning
empirical
international
bottomtraining
early
strategy
paradigm
decision
resistive
flow
proposition 374
eigenvalues
genetic
cat
law
decreasingprove
answer
225
track
received
histogram
fiber
exponentially
230
critic
gibb
231
variability
greedy
biol
270
traffic
intrinsic
explore
207
electrical
koch
convex
formation
frames
approximated
preliminary
expansion
biases
95
focus
successfully
171
straightforward
evolution
extract
behaviour
recognize
fine
specifically
169
extra
minimization130
updated
latent
generation
organizing
partial
physic
produces
pathway
ica
sensitivity
94
manner
extraction
typical
Figure 5: Extended neighborhoods of some of the words estimated from the NIPS dataset. We run our algorithm for the top 5000keywords global and local hierarchical structures.
secur
36
studi
323
administr
effort
324
consid
govern
leav
wordmovi
rais
result
37
killserv
estat
38
produc
posit
charg
welcom
Figure 6: A local view of recovered NY Times keywords structure
D EXPERIMENTS ON HEALTHCARE DATASETS
Quantitative Analysis We evaluate our resulting hierarchy with a ground truth tree, based on medical knowledge2.We compare our results with a baseline: the agglomerative clustering. Evaluating the performance of tree structurerecovery is indeed nontrivial as the number and location of the hidden variables vary as well as the depths. Twotrees may be similar but may look different due to the difference that how refined the hierarchical clusterings are.The standard Robinson Foulds (RF) metric ((Robinson and Foulds, 1981))(between our estimated latent tree and theground truth tree) is computed to evaluate the structure recovery in Table 3. RF is a well defined metric for evaluatingthe difference of two tree structures. Similar to other distance metrics, smaller RF indicates the recovered structurebeing closer to the ground-truth.The smaller the metric is, the better the recovered tree is. The proposed method isslightly better than the baseline and the advantage increases with more nodes. However, our proposed method providesan efficient probabilistic graphical model that can support general inference which is not feasible with the baseline.
Data p RF(agglo.) RF(proposed)
MIMIC2 163 0.0061 0.0061CMS 168 0.0060 0.0059
MIMIC2 952 0.0060 0.0011
Table 3: Robinson Foulds (RF) metric compared with the “ground-truth” tree for both MIMIC2 and CMS dataset. Our proposedresults are better as we increase the number of nodes.
Qualitative analysis Here we report the results from the 2-dimensional case (i.e., observed variable is binary). InFigure 7 in appendix D.2, we show a portion of the learned tree using the MIMIC2 healthcare data. The yellow nodesare latent nodes from the learned subtrees while the blue nodes represent observed nodes(diagnosis codes) in theoriginal dataset. Diagnoses that are similar were generally grouped together. For example, many neoplastic diseaseswere grouped under the same latent node (node 1135). While some dissimilar diseases were grouped together, thereusually exists a known or plausible association of the diseases in the clinical setting. For example, in Figure 7 inappendix D.2, clotting-related diseases and altered mental status were grouped under the same latent node as severalneoplasms. This may reflect the fact that altered mental status and clotting conditions such as thrombophlebitis canoccur as complications of neoplastic diseases ((Falanga et al., 2003)). The association of malignant neoplasms ofprostate and colon polyps, two common cancers in males, is captured under latent node 1136 ((Group et al., 2014)).
For both the MIMIC2 and CMS datasets, we performed a qualitative comparison of the resulting trees while varyingthe hidden dimension k for the algorithm. The resulting trees for different values of k did not exhibit significantdifferences. This implies that our algorithm is robust with different choices of hidden dimensions. The estimatedmodel parameters are also robust for different values of k based on the results.
Scalability Our algorithm is scalable w.r.t. varying characteristics of the input data. First, it can handle a large numberof patients efficiently, as shown in Figure 9(a). It has a linear scaling behavior as we vary the number observed nodes,as shown in Figure 9(b). Furthermore, even in cases where the number of observed variables is large, our methodmaintains an almost linear scale-up as we vary the computational power available, as shown in Figure 9(c). So, byproviding the respective resources, our algorithm is practical under any variation of the input data characteristics.
D.1 Data Description
(1) MIMIC2: The MIMIC2 dataset record disease history of 29,862 patients where a overall of 314,647 diagnosticevents over time representing 5675 diseases are logged. We consider patients as samples and groups of diseases asvariables. We analyze and compare the results by varying the group size (therefore varying d and p).
(2) CMS: The CMS dataset includes 1.6 million patients, for whom 15.8 million medical encounter events are logged.Across all events, 11,434 distinct diseases (represented by ICD codes) are logged. We consider patients as samplesand groups of diseases as variables. We consider specific diseases within each group as dimensions. We analyzeand compare the results by varying the group size (therefore varying d and p). While the MIMIC2 dataset and CMSdataset both contain logged diagnostic events, the larger volume of data in CMS provides an opportunity for testing
2The ground truth tree is the PheWAS hierarchy provided in the clinical study ((Denny et al., 2010))
the algorithm’s scalability. We qualitatively evaluate biological implications on MIMIC2 and quantitatively evaluatealgorithm performance and scalability on CMS.
To learn the disease hierarchy from data, we also leverage some existing domain knowledge about diseases. In partic-ular, we use an existing mapping between ICD codes and higher-level Phenome-wide Association Study (PheWAS)codes ((Denny et al., 2010)). We use (about 200) PheWAS codes as observed nodes and the observed node dimensionis set to be binary (d = 2) or the maximum number of ICD codes within a pheWAS code (d = 31).
D.2 Results
Case d =31: We learn a tree from the MIMIC2 dataset, in which we grouped diseases into 163 pheWAS codes and upto 31 dimensions per variable. Figure 8 in appendix D.2 shows a portion of the learned tree of four subtrees which allreflect similar diseases relating to trauma. A majority of the learned subtrees reflected clinically meaningful concepts,in that related and commonly co-occurring diseases tended to group together in the same subtrees or in nearby subtrees.
We also learn the disease tree from the larger CMS dataset, in which we group diseases into 168 variables and up to31 dimensions per variable. Similar to the case from the MIMIC2 dataset, a majority of learned subtrees reflectedclinically meaningful concepts.
Figure 7: An example of two subtrees which represent groups of similar diseases which may commonly co-occur. Nodes coloredyellow are latent nodes from learned subtrees.
E ADDITIVITY OF THE MULTIVARIATE INFORMATION DISTANCE
Recall that the additive information distance between nodes two categorical variables xi and xj was defined inChoi et al. ((2011)). We extend the notation of information distance to high dimensional variables via Definition 4.1and present the proof of its additivity in Lemma 4.2 here.
Figure 8: An example of four subtrees which represent groups of similar diseases which may commonly co-occur. Most variablesin this subtree are related to trauma.
0 0.5 1 1.5
·106
0
2,000
4,000
6,000
8,000
Number of samples
Runnin
gti
me
(sec
onds)
(a) Running time vs Number of samples
0 200 400 600 800 1,0000
0.5
1
1.5
·104
Number of observed nodes
Runnin
gti
me
(sec
onds)
(b) Running time vs Number of nodes
0 20 40 600
20
40
60
Number of threads
Spee
d-u
pfa
ctor
(c) Speed-up vs available threads
Method speed-up
Ideal speed-up
Figure 9: (a) CMS dataset sub-sampling w.r.t. varying number of samples. (b) MIMIC2 dataset sub-sampling w.r.t. varying numberof observed nodes. Each one of the observed nodes is binary (d = 2). (c) MIMIC2 dataset: Scaling w.r.t. varying computationalpower, establishing the scalability of our method even in the large p regime. The number of observed nodes is 1083 and each oneof them is binary (p = 1083, d = 2).
Proof.
E[xax⊤c ] = E[E[xax
⊤c |xb]] = AE[xbx
⊤b ]B
⊤
Consider three nodes a, b, c such that there are edges between a and b, and b and c. Let the A = E(xa|xb) andB = E(xc|xb). From Definition 4.1, we have, assuming that E(xax⊤
a ), E(xbx⊤b ) and E(xcx⊤
c ) are full rank.
dist(va, vc) = − log
k%i=1
σi(E(xax⊤c ))
&det(E(xax⊤
a )) det(E(xcx⊤c ))
e−dist(va,vc) = det'E(xax
⊤a )
−1/2U⊤E(xax
⊤c )V E(xcx
⊤c )
−1/2(
where k-SVD((E(xax⊤c )) = UΣV ⊤). Similarly,
e−dist(va,vb) = det'E(xax
⊤a )
−1/2U⊤E(xax
⊤b )WE(xbx
⊤b )
−1/2(
e−dist(vb,vc) = det'E(xbx
⊤b )
−1/2W⊤E(xbx
⊤c )V E(xcx
⊤c )
−1/2(
where k-SVD((E(xax⊤b )) = UΣW⊤) and k-SVD((E(xbx⊤
c )) = WΣV ⊤).
Therefore,
e−(dist(a,b)+dist(b,c)) = det(E(xax⊤a )
−1/2U⊤E(xax
⊤b )E(xbx
⊤b )
−1/2−1/2E(xbx
⊤c )V E(xcx
⊤c )
−1/2)
= det(E(xax⊤a )
−1/2U⊤AE(xbx⊤b )B
⊤V E(xcx⊤c )
−1/2) = e−dist(va,vc)
We conclude that the multivariate information distance is additive. Note that E)xax⊤
b
*= E
+E+xax⊤
b |xb
,,=
E+Axbx⊤
b
,= AE(xbx⊤
b ).
We note that when the second moments are not full rank, the above distance can be extended as follows:
dist(va, vc) = − log
k%i=1
σi(E(xax⊤c ))
-k%
i=1σi(E(xax⊤
a ))k%
i=1σi(E(xcx⊤
c ))
.
F LOCAL RECURSIVE GROUPING
The Local Recursive Grouping (LRG) algorithm is a local divide and conquer procedure for learning the structureand parameter of the latent tree (Algorithm 1). We perform recursive grouping simultaneously on the sub-trees ofthe MST. Each of the sub-tree consists of an internal node and its neighborhood nodes. We keep track of the internalnodes of the MST, and their neighbors. The resultant latent sub-trees after LRG can be merged easily to recover thefinal latent tree. Consider a pair of neighboring sub-trees in the MST. They have two common nodes (the internalnodes) which are neighbors on MST. Firstly we identify the path from one internal node to the other in the trees tobe merged, then compute the multivariate information distances between the internal nodes and the introduced hiddennodes. We recover the path between the two internal nodes in the merged tree by inserting the hidden nodes closelyto their surrogate node. Secondly, we merge all the leaves which are not in this path by attaching them to their parent.Hence, the recursive grouping can be done in parallel and we can recover the latent tree structure via this mergingmethod.
Lemma F.1. If an observable node vj is the surrogate node of a hidden node hi, then the hidden node hi can be
discovered using vj and the neighbors of vj in the MST.
This is due to the additive property of the multivariate information distance on the tree and the definition of a surrogatenode. This observation is crucial for a completely local and parallel structure and parameter estimation. It is also easyto see that all internal nodes in the MST are surrogate nodes.
After the parallel construction of the MST, we look at all the internal nodes Xint. For vi ∈ Xint, we denote theneighborhood of vi on MST as nbdsub(vi;MST) which is a small sub-tree. Note that the number of such sub-trees isequal to the number of internal nodes in MST.
For any pair of sub-trees, nbdsub(vi;MST) and nbdsub(vj ;MST), there are two topological relationships, namely over-lapping (i.e., when the sub-trees share at least one node in common) and non-overlapping (i.e., when the sub-trees donot share any nodes).
Since we define a neighborhood centered at vi as only its immediate neighbors and itself on MST, the overlappingneighborhood pair nbdsub(vi;MST) and nbdsub(vj ;MST) can only have conflicting paths, namely path(vi, vj ;N i) andpath(vi, vj ;N j), if vi and vj are neighbors in MST.
With this in mind, we locally estimate all the latent sub-trees, denoted as N i, by applying Recursive Group-ing ((Choi et al., 2011)) in a parallel manner on nbdsub(vi;MST), ∀vi ∈ Xint. Note that the latent nodes automaticallyintroduced by RG(vi) have vi as their surrogate. We update the tree structure by joining each level in a bottom-upmanner. The testing of the relationship among nodes ((Choi et al., 2011)) uses the additive multivariate informationdistance metric (Appendix E) Φ(vi, vj ; k) = dist(vi, vk) − dist(vi, vk) to decide whether the nodes vi and vj areparent-child or siblings. If they are siblings, they should be joined by a hidden parent. If they are parent and child, the
child node is placed as a lower level node and we add the other node as the single parent node, which is then joined inthe next level.
Finally, for each internal edge of MST connecting two internal nodes vi and vj , we consider merging the latent sub-trees. In the example of two local estimated latent sub-trees in Figure 1, we illustrate the complete local mergingalgorithm that we propose.
G PROOF SKETCH FOR THEOREM 7.1
We argue for the correctness of the method under exact moments. The sample complexity follows from the previousworks. In order to clarify the proof ideas, we define the notion of surrogate node ((Choi et al., 2011)) as follows.
Definition G.1. Surrogate node for hidden node hi on the latent tree T = (V , E) is defined as Sg(hi; T ) :=arg min
vj∈Xdist(vi, vj).
In other words, the surrogate for a hidden node is an observable node which has the minimum multivariate informationdistance from the hidden node. See Figure 1(a), the surrogate node of h1, Sg(h1; T ), is v3, Sg(h2; T ) = Sg(h3; T ) =v5. Note that the notion of the surrogate node is only required for analysis, and our algorithm does not need to knowthis information.
The notion of surrogacy allows us to relate the constructed MST (over observed nodes) with the underlying latent tree.It can be easily shown that contracting the hidden nodes to their surrogates on latent tree leads to MST. Local recursivegrouping procedure can be viewed as reversing these contractions, and hence, we obtain consistent local sub-trees.
We now argue the correctness of the structure union procedure, which merges the local sub-trees. In each reconstructedsub-tree N i, where vi is the group leader, the discovered hidden nodes {hi} form a surrogate relationship with vi, i.e.Sg(hi; T ) = vi. Our merging approach maintains these surrogate relationships. For example in Figure 1(d1,d2), wehave the path v3 − h1 − v5 in N 3 and path v3 − h3 − h2 − v5 in N 5. The resulting path is v3 − h1 − h3 − h2 − v5,as seen in Figure 1(e). We now argue why this is correct. As discussed before, Sg(h1; T ) = v3 and Sg(h2; T ) =Sg(h3; T ) = v5. When we merge the two subtrees, we want to preserve the paths from the group leaders to theadded hidden nodes, and this ensures that the surrogate relationships are preserved in the resulting merged tree. Thus,we obtain a global consistent tree structure by merging the local structures. The correctness of parameter learningcomes from the consistency of the tensor decomposition techniques and careful alignments of the hidden labels acrossdifferent decompositions. Refer to Appendix H, K for proof details and the sample complexity.
H PROOF OF CORRECTNESS FOR LRG
Definition H.1. A latent tree T≥3 is defined to be a minimal (or identifiable) latent tree if it satisfies that each latent
variable has at least 3 neighbors.
Definition H.2. Surrogate node for hidden node hi in latent tree T = (V , E) is defined as
Sg(hi; T ) := arg minvj∈X
dist(vi, vj).
There are some useful observations about the MST in Choi et al. ((2011)) which we recall here.
Property H.3 (MST − surrogate neighborhood preservation). The surrogate nodes of any two neighboring nodes in
E are also neighbors in the MST. I.e.,
(hi, hj) ∈ E ⇒ (Sg(hi), Sg(hj)) ∈ MST.
Property H.4 (MST − surrogate consistency along path). If vj ∈ X and vh ∈ Sg−1(vj), then every node along the
path connecting vj and vh belongs to the inverse surrogate set Sg−1(vj), i.e.,
vi ∈ Sg−1(vj), ∀vi ∈ Path(vj , vh)
if
vh ∈ Sg−1(vj).
The MST properties observed connect the MST over observable nodes with the original latent tree T . We obtain MSTby contracting all the latent nodes to its surrogate node.
Given that the correctness of CLRG algorithm is proved in Choi et al. ((2011)), we prove the equivalence between theCLRG and PLRG.
Lemma H.5. For any sub-tree pairs nbd[vi;MST] and nbd[vi;MST], there is at most one overlapping edge. Theoverlapping edge exists if and only if vi ∈ nbd(vj ;MST).
This is easy to see.
Lemma H.6. Denote the latent tree recovered from nbd[vi;MST] as N i and similarly for nbd[vj ;MST]. The incon-
sistency, if any, between N i and N j occurs in the overlapping path(vi, vj ;N i) in and path(vi, vj ;N j) after LRG
implementation on each subtrees.
We now prove the correctness of LRG. Let us denote the latent tree resulting from merging a subset of small latenttrees as TLRG(S), where S is the set of center of subtrees that are merged pair-wisely. CLRG algorithm in Choi et al.((2011)) implements the RG in a serial manner. Let us denote the latent tree learned at iteration i from CLRG isTCLRG(S), where S is the set of internal nodes visited by CLRG at current iteration . We prove the correctness of LRGby induction on the iterations.
At the initial step S = ∅: TCLRG = MST and TLRG = MST , thus TCLRG = TLRG.
Now we assume that for the same set Si−1, TCLRG = TLRG is true for r = 1, . . . , i−1. At iteration r = i where CLRGemploys RG on the immediate neighborhood of node vi on TCLRG(Si−1), let us assume that Hi is the set of hiddennodes who are immediate neighbors of i−1. The CLRG algorithm thus considers all the neighbors and implements theRG. We know that the surrogate nodes of every latent node in Hi belong to previously visited nodes Si−1. Accordingto Property H.3 and H.4, if we contract all the hidden node neighbors to their surrogate nodes, CLRG thus is a RG onneighborhood of i on MST.
As for our LRG algorithm at this step, TLRG(Si) is the merging between TLRG(Si−1)and N i. The latent nodes whosesurrogate node is j are introduced between the edge (i−1, i). Now that we know N i is the RG output from immediateneighborhood of i on MST. Therefore, we proved that TCLRG(Si) = TLRG(Si).
I CROSS GROUP ALIGNMENT CORRECTION
In order to achieve cross group alignments, tensor decompositions on two cross group triplets have to be computed.The first triplet is formed by three nodes: reference node in group 1, x1, non-reference node in group 1, x2, andreference node in group 2, x3. The second triplet is formed by three nodes as well: reference node in group 2, x3,non-reference node in group 2, x4 and reference node in group 1, x1. Let us use h1 to denote the parent node in group1, and h2 the parent node in group 2.
From Trip(x1, x2, x3), we obtain P (h1|x1) = A, P (x2|h1) = B and P (x3|h1) = P (x3|h2)P (h2|h1) = DE. FromTrip(x3, x4, x1), we know P (x3|h2) = DΠ, P (x4|h2) = CΠ and P (h2|x1) = P (h2|h1)P (h1|x1) = ΠEA, where
Π is a permutation matrix. We compute Π as Π =.
(ΠEA)(A)†(DE)†(DΠ) so that D = (DΠ)Π† is aligned with
group 1. Thus, when all the parameters in the two groups are aligned by permute group 2 parameters using Π, thus thealignment is completed.
Similarly, the alignment correction can be done by calculating the permutation matrices while merging differentthreads.
Overall, we merge the local structures and align the parameters from LRG locla sub-trees using Procedure 2 and 3.
J COMPUTATIONAL COMPLEXITY
We recall some notations here: d is the observable node dimension, k is the hidden node dimension (k ≪ d), N is thenumber of samples, p is the number of observable nodes, and z is the number of non-zero elements in each sample.
Multivariate information distance estimation involves sparse matrix multiplications to compute the pairwise second
moments. Each observable node has a d × N sample matrix with z non-zeros per column. Computing the productx1xT
2 from a single sample for nodes 1 and 2 requires O(z) time and there are N such sample pair products leadingto O(Nz) time. There are O(p2) node pairs and hence the degree of parallelism is O(p2). Next, we perform thek-rank SVD of each of these matrices. Each SVD takes O(d2k) time using classical methods. Using randomizedmethods ((Gittens and Mahoney, 2013b)), this can be improved to O(d+ k3).
Next on, we construct the MST in O(log p) time per worker with p2 workers. The structure learning can be done inO(Γ3) per sub-tree and the local neighborhood of each node can be processed completely in parallel. We assume thatthe group sizes Γ are constant (the sizes are determined by the degree of nodes in the latent tree and homogeneity ofparameters across different edges of the tree. The parameter estimation of each triplet of nodes consists of implicitstochastic updates involving products of k × k and d × k matrices. Note that we do not need to consider all possibletriplets in groups but each node must be take care by a triplet and hence there are O(p) triplets. This leads to a factorof O(Γk3 + Γdk2) time per worker with p/Γ degree of parallelism.
At last, the merging step consists of products of k × k and d × k matrices for each edge in the latent tree leading toO(dk2) time per worker with p/Γ degree of parallelism.
K SAMPLE COMPLEXITY
From Theorem 11 of Choi et al. ((2011)), we recall the number of samples required for the recovery of the tree structurethat is consistent with the ground truth (for a precise definition of consistency, refer to Definition 2 of Choi et al.((2011))).
From Anandkumar et al. ((2012a)), we recall the sample complexity for the faithful recovery of parameters via tensordecomposition methods.
We define ϵP to be the noise raised between empirical estimation of the second order moments and exact second ordermoments, and ϵT to be the noise raised between empirical estimation of the third order moments and the exact thirdorder moments.
Lemma K.1. Consider positive constants C, C′, c and c′, the following holds. If
ϵP ≤ cλk
λ1
k, ϵT ≤ c′
λkσ3/2k
k
N ≥ C
/log(k) + log
/log
/λ1σ
3/2k
ϵT+
1
ϵP
000
L ≥ poly(k) log(1/δ),
then with probability at least 1 − δ, tensor decomposition returns (#vi,λi) : i ∈ [k] satisfying, after appropriate
reordering,
∥#vi − vi∥2 ≤ C′
11
λi
1
σ2k
ϵT +
1λ1
λi
1√σk
+ 1
2ϵP
2
| #λi − λi| ≤ C′
/1
σ3/2k
ϵT + λ1ϵP
0
for all i ∈ [k].
We note that σ1 ≥ σ2 ≥ . . .σk > 0 are the non-zero singular values of the second order moments, λ1 ≥ λ2 ≥ . . . ≥λk > 0 are the ground-truth eigenvalues of the third order moments, and vi are the corresponding eigenvectors for alli ∈ [k].
L EFFICIENT SVD USING SPARSITY AND DIMENSIONALITY REDUCTION
Without loss of generality, we assume that a matrix whose SVD we aim to compute has no row or column which isfully zeros, since, if it does have zero entries, such row and columns can be dropped.
Let A ∈ Rn×n be the matrix to do SVD. Let Φ ∈ Rd×k, where k = αk with α is a scalar, usually, in the range[2, 3]. For the ith row of Φ, if
"i |Φ|(i, :) = 0 and
"i |Φ|(:, i) = 0, then there is only one non-zero entry and that
entry is uniformly chosen from [k]. If either"
i |Φ|(i, :) = 0 or"
i |Φ|(:, i) = 0, we leave that row blank. LetD ∈ Rd×d be a diagonal matrix with iid Rademacher entries, i.e., each non-zero entry is 1 or −1 with probability12 . Now, our embedding matrix ((Clarkson and Woodruff, 2013)) is S = DΦ, i.e., we find AS and then proceed withthe Nystrom ((Huang et al., 2013)) method. Unlike the usual Nystrom method ((Gittens and Mahoney, 2013a)) whichuses a random matrix for computing the embedding, we improve upon this by using a sparse matrix for the embeddingsince the sparsity improves the running time and the memory requirements of the algorithm.