appendix:guaranteed scalable learning of latent tree mode...

11
Appendix:Guaranteed Scalable Learning of Latent Tree Models A RELATED WORKS There has been widespread interest in developing distributed learning techniques, e.g. the recent works of Smola and Narayanamurthy ((2010)) and Wei et al. ((2013)). These works consider parameter estimation via likelihood-based optimizations such as Gibbs sampling, while our method involves more challenging tasks where both the structure and the parameters are estimated. Simple methods such as local neighborhood selection through 1 - regularization ((Meinshausen and B¨ uhlmann, 2006)) or local conditional independence testing ((Anandkumar et al., 2012c)) can be parallelized, but these methods do not incorporate hidden variables. Finally, note that the latent tree models provide a statistical description, in addition to revealing the hierarchy. In contrast, hierarchical clustering tech- niques are not based on a statistical model ((Krishnamurthy et al., 2012)) and cannot provide valuable information such as the level of correlation between observed and hidden variables. Both Chang and Hartigan ((1991)) and Chang ((1996)) provide detailed identifiability analyses of latent trees and the former ((Chang and Hartigan, 1991)) introduces maximum likelihood estimation method, different from our guaran- teed parallel spectral method using tensor decomposition. P Parikh et al. ((2011)), Song et al. ((2011, 2014)) are closely related to our paper. The P Parikh et al. ((2011)) paper does not address the structure learning. The sample complexity is polynomial in k, whereas ours is logarithmic in k. The Song et al. ((2014)) paper provides guarantees for structure learning, but not for parameter estimation. The Song et al. ((2011)) paper does not provide sample complexity theorem or analysis for recovering the latent tree structure or parameter with provable guarantees. B SYNTHETIC EXPERIMENTS B.1 Machine Setup Setup Experiments are conducted on a server running the Red Hat Enterprise 6.6 with 64 AMD Opteron processors and 265 GBRAM. The program is written in C++, coupled with the multi-threading capabilities of the OpenMP environment ((Dagum and Menon, 1998)) (version 1.8.1). We use the Eigen toolkit 1 where BLAS operations are incorporated. For SVDs of large matrices, we use randomized projection methods ((Gittens and Mahoney, 2013b)) as described in Appendix L. B.2 Synthetic Results: We compare our method with the implementation of Choi et al. ((2011)), where a serial learning procedure is carried out for binary variables (d = 2) and parameter learning is carried out via EM. We consider a latent tree model over nine observed variables four hidden nodes. We restart EM 20 times, and select the best result. The results are in Figure 3. We measure the structure recovery error via | G| i=1 min j | G i ̸G j |/| G i | /| G|, where G and G are the ground-truth and recovered categories. We measure the parameter recovery error via E = Aˆ AΠF , where A is the true parameter and ˆ A is the estimated parameter and Π is a suitable permutation matrix that aligns the columns of ˆ A with A so that they have minimum distance. Π is greedily calculated. It is shown that as number of samples increases, both methods recover the structure correctly, as predicted by the theory. However, EM is stuck in local optima and fails to recover the true parameters, while the tensor decomposition correctly recovers the true parameters. We then present our experimental results focusing on the large p and even larger d regime. See Table 2. We achieve efficient running times with good accuracy for structure and parameter estimation. We perform experiments on a synthetic tree with observable dimension d =1, 000, 000, number of observable nodes p =3, hidden dimension k =3 and number of samples N = 1000 which is similar to community detection setting of 1 http://eigen.tuxfamily.org/index.php?title=Main_Page

Upload: others

Post on 18-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Appendix:Guaranteed Scalable Learning of Latent Tree Mode lsauai.org/uai2019/proceedings/supplements/319_supplement.pdf · interpreting smooth score background clues architectures

Appendix:Guaranteed Scalable Learning of Latent Tree Models

A RELATED WORKS

There has been widespread interest in developing distributed learning techniques, e.g. the recent worksof Smola and Narayanamurthy ((2010)) and Wei et al. ((2013)). These works consider parameter estimation vialikelihood-based optimizations such as Gibbs sampling, while our method involves more challenging tasks whereboth the structure and the parameters are estimated. Simple methods such as local neighborhood selection through ℓ1-regularization ((Meinshausen and Buhlmann, 2006)) or local conditional independence testing ((Anandkumar et al.,2012c)) can be parallelized, but these methods do not incorporate hidden variables. Finally, note that the latent treemodels provide a statistical description, in addition to revealing the hierarchy. In contrast, hierarchical clustering tech-niques are not based on a statistical model ((Krishnamurthy et al., 2012)) and cannot provide valuable informationsuch as the level of correlation between observed and hidden variables.

Both Chang and Hartigan ((1991)) and Chang ((1996)) provide detailed identifiability analyses of latent trees and theformer ((Chang and Hartigan, 1991)) introduces maximum likelihood estimation method, different from our guaran-teed parallel spectral method using tensor decomposition.

P Parikh et al. ((2011)), Song et al. ((2011, 2014)) are closely related to our paper. The P Parikh et al. ((2011)) paperdoes not address the structure learning. The sample complexity is polynomial in k, whereas ours is logarithmicin k. The Song et al. ((2014)) paper provides guarantees for structure learning, but not for parameter estimation.The Song et al. ((2011)) paper does not provide sample complexity theorem or analysis for recovering the latent treestructure or parameter with provable guarantees.

B SYNTHETIC EXPERIMENTS

B.1 Machine Setup

Setup Experiments are conducted on a server running the Red Hat Enterprise 6.6 with 64 AMD Opteron processorsand 265 GBRAM. The program is written in C++, coupled with the multi-threading capabilities of the OpenMPenvironment ((Dagum and Menon, 1998)) (version 1.8.1). We use the Eigen toolkit1 where BLAS operations areincorporated. For SVDs of large matrices, we use randomized projection methods ((Gittens and Mahoney, 2013b)) asdescribed in Appendix L.

B.2 Synthetic Results:

We compare our method with the implementation of Choi et al. ((2011)), where a serial learning procedure is carriedout for binary variables (d = 2) and parameter learning is carried out via EM. We consider a latent tree model overnine observed variables four hidden nodes. We restart EM 20 times, and select the best result. The results are inFigure 3.

We measure the structure recovery error via

!| !G|"i=1

minj | #Gi ∈ Gj |/| #Gi|$/| #G|, where G and #G are the ground-truth

and recovered categories. We measure the parameter recovery error via E = ∥A−AΠ∥F , whereA is the true parameterand A is the estimated parameter and Π is a suitable permutation matrix that aligns the columns of A with A so thatthey have minimum distance. Π is greedily calculated. It is shown that as number of samples increases, both methodsrecover the structure correctly, as predicted by the theory. However, EM is stuck in local optima and fails to recoverthe true parameters, while the tensor decomposition correctly recovers the true parameters.

We then present our experimental results focusing on the large p and even larger d regime. See Table 2. We achieveefficient running times with good accuracy for structure and parameter estimation.

We perform experiments on a synthetic tree with observable dimension d = 1, 000, 000, number of observable nodesp = 3, hidden dimension k = 3 and number of samples N = 1000 which is similar to community detection setting of

1http://eigen.tuxfamily.org/index.php?title=Main_Page

Page 2: Appendix:Guaranteed Scalable Learning of Latent Tree Mode lsauai.org/uai2019/proceedings/supplements/319_supplement.pdf · interpreting smooth score background clues architectures

Huang et al. ((2013)). We note that the recovery of the structure and the parameters is done in 10.6 seconds.

C EXPERIMENTS ON NIPS AND NY TIMES DATASET

We use the entire NIPS dataset from the UCI bag-of-words repository. This consists of 1500 documents and 12419words in the vocabulary. We estimate the hierarchical structure of the whole corpus in only 15332.4 seconds (4hours). Additionally, we focus on the top 5000 most frequently appeared keywords and illustrate the local structuresin Figure 5. The running time for the subset is only 1164.4 seconds (20 minutes).

We also use the NY Times dataset from the UCI bag-of-words repository. This consists of 300000 documents and102660 words in the vocabulary. Our algorithm estimates the hierarchical structure of 3000 most frequently appearedkeywords in the NY Times dataset in only 107.8 seconds. Note that d = 2, since we consider the occurrence of a wordin a document as a binary variable. Below is a subset of the keywords graph we estimated in Figure 6. We note that therelationships the among the words in Figure 6 match intuition. For example, govern and secur are grouped togetherwhereas movi, studio and produc are grouped together. The numbers represent hidden nodes.

353

biologically

percent

charge

utterances

755

orbit

754

minimaxstick

isi

logic

719

music

html

maximizes

remove

pratt

syn

502

appeared

503

dempster

truncated

clique

hopkin

going

exhaustive

reject

email500

dramatically

iterated

753

suggesting

501

bell

maximize

mdp

topologies

supply

710

optima

todd

oam

nonnegative

principles

regularities

499

sine

minimized

dot

combining

spectral

longer

written

agree235

690

drives

phonology

subsequent

operating

school

689

psychology

pick

shapes

topographic

adjustedwta

trivial

electric504

designingdna

505

facial

variances

opposite

statistically

xor

examined

neu

try

counting

bursting

referred

corre

theoretic

preference

logistic

transmission

451

arrow

readily

chose

motivated

richard

bellman

incorporated

cut

generates

proposition

regimemassachusett

voltages

prevent

road

tdnn

arbitrarily

adopted

plateau

credit

illustrate

switch

709neighbour

hill

passed

pose

corpusanalogy

naive

integratebelieve

suite

record

neuro

356

gates

giving

908

decay

inverse

selected

characteristic

decoded

updates

speaker

give

area

194

evidence

doubling

decompose

feedforward

central

queryrisk

allowing

types

160

extracted

definite

actual

paper

114

slightly

integrated

matching

measuredtransistor

spotting

capturing

stomatogastricsizes

station

status

bayes

sparsely

later

splitting

possibly

adjusting

cancer

fitting

pure

completely

maximal

codes

member

lie

263

weighting

133

roughly

lemma

begin

reduces

dashed

allowed

classical

151

item

trick

149

sur

898

clarkextend

accommodate

unnecessary

becker

342

134

building

performing

static

affected

monotonic

settling

smc

rarely

singularities

inherently

lightness

lissom

completeness

comprised

896

previously

rotational

395

accomplished

integrates

randomness

terminates

interpolating

gardner

dent

arrive

inhibited

probabil

mirror

presumably

externally

dbn

486

justified

672

half

eventually

hit

nerve

dedicated

resolved

paramagnetic

diverge

timestep

thing

clock

std

valuable

devoted

resistance

487

120

845

repeatedly

traveling

chaos roll

comm

396

stabilize

847

darken

838

brieflypropagated

671

prentice

inverterkanerva

improves

independently

marginal

correlator

thomas

string

barto

template

837

460

faced

pictures

decaying

submitted

670

filling

demonstration

moleculesaccount

preparationdaily

seg

906835

finally

edges

morris

crude

mutation

684

drawn

symbol

square

damage

100

considering

processor

morphology

706

grammar

intracellular

colleagues

669

836

induce

282

scattered

897analytic

inspection

rigid

907144

283

plot

national

centered

112

svm

averaging

lee

divergence

derivatives

312

denker

corrected

serialpairwise

assign

move

leibler

234

weiss

captures

soma

including

common

probabilistic

233

thrun

capacity

essential

86

bases

generally

shunting

output

approx

tures

affecting

haussler

315

kaufmann

lies

located

york

778

inference

unreliable

rating

neuromorphic

537

538

536

receiving

symbolic

doesn

cornput

blood

result

learning

weight

sending

style

pfm

rel

problematic

interpreting

smooth

score

background

clues

architectures

jordan

cold

approximately

testing

draw

173741

gamma

281

190

size

tran

unlabeled

lem

smyth

semantic

meaning

partitioning

278

necessarily

watkin

539

clear

regularity

number

279

super

cart

error

problem

function

276

parameter

277

2

neuron

pattern

restriction

muscle

rev

noted

indicating

spectrogram

stopped

tablesescape

507 environmental

506

training

set

model

data

input

point

method

system

neural

algorithm

vector

hand

small

assignment

computation

optimize horizontal

provided

hopfield

type

kirkpatrick

constructed

encoding

improving

17

male

procedure

membrane

key

continued

language

integration

implemented

gradient

orthonormal

shallowpreservation

rasmussen

proximation

74

optimal

frame

signal

reduced

noisy

associative

setting

select

rulelarge

net

experiment transform

proper

grantdesigned

rogerturn

130

remain

relevant

115

sejnowski

identical

138

improve

stand

micro

magnitude

washington

121

firstly

apply

unsupervised

separation

113

producedmodule

recent

question

williamdecreases

hold

recently arm

information

interest

measurement

match volume

feed

database

quadraticlikely

advantage

structures

separate

tuning

exact

prioripark

687

neuronal

figures

connection

bound

projection

implementationproperties

references

compute

difficult

nodes

interval

supportgain

low

object

iteration

space

examples

target

obtain

dependent

basis

rbf

run

cortical

detection

pages

negative

variables

leadordermachine

multi

performed

total

multiple

directly

yieldgeneralization

science discussion

pair

timestrajectory

connected

feedback

analysis primary

paradigm

process

demonstrates

lower

roy

depict

com

rod

symmetric

mozer

28

modeling

case

artificial

optimization

13

practice

4544

required

press

12

october

97

frequency

letter

mixtures

learn

technology

relation

sensory

partvan

position

cluster

behaviorsegment

10

step

eye

index

6

foundsimilar

distinct

touretzky

firing

det

decided

programming

tend

diversity

norm

variance

predicted evaluation

generating

typically

objective

line

control

activityhebbian

confidence

underlying

unknown

future

40

form58

118

characterized

presence

restricted

document

165

216

early

stanford

sources

314

schraudolph

adding

152

principal

sampling

steve

sonn

shrink

monk

equilibrium

population

simply

smallerareas

excitatory

produce

158

column

85

ganglion

consecutive

776

nature

describe

closevalues

preceding introduce

pruning

313graphic

kindcrossover

evaluating

course

298

basic

california subset

mlp

tested

sliding

averaged

limited

investigation

60

lar

decrease

committee

injection

matches

grow

raw

vivo

weigend

robotic

subtracting

137

ultimate

ventral

fit

stimuli

weighted

val

seen

validation

encodes

santa

chaptersaturation

tap

plan

neighbor

additional

operator

significant

scheme

queue

receiver

covered

asset

baluja

argued

packet

monotonicity

80

alamos

entirely

choice

component

amount

bias

ambiguities269

cottrell

blocking

cat

eliminated

carrier bregman

numericalclassic

returned

binomial

bradtke

observe

900

priming

placement

approaching

petsche

reflectance

modification

predict

subjective

summationproof

bilinear

bauer

135

approximate

unstable

coupling

finger

asymptotically

834

competition

analyzed

nonlinearity

amplification

action

93

related

originalproc

vol

perceptron

movement

feature

differences

55

311

61

3

series

observed

density

desired

define update

nonlinear

probabilities

phase

belief

68

33

stored

continuous

35

currentcurves

compared

field sequences

human

final

57

receptive

75standard

53

814

stimulus87

advances

pixel

recognition

47

73

18

67

38

dependconvergence

base

makes

decision

reconstructiondeterministic

correspond

architecture

71

considerable

perception

270 boundary

hierarchy

gaussian

degrees

detect

build

proportion

strokes

correspondences

striking

777

concept

scene

editor

direction

104

variety

taking

window

squashingdistribution

digitend

criterionconditional

arbitrary

725

conference

coding

derived

concentrate

parsing

responsible

145

comparative

perturbed

assess

alternating

memories

ergodic

associate

neurophysiology

user

murphy

suppressed

551

stack

mrf

nealaggregate

adv

901

291

appearanceconversely

aim

recover

send

902

amherst

292

550

addressed

progress

459

vowel

michaeltemperature

140

143

394

78

mix

performance

journal

102

chaindeveloped

learner

boltzmann

synaptic

identification

122 non

character min

correlated

rules

20

instance

zipset

width

iii

solid

robust

simple

showed

alternative

spectrum

hard

sutton

varying

assume

known

stochastic

109

university

458

cutoff

808

hundred

proportional

ahmad

ensembles

translation

past

say

accuracy

direct

dept

normalized

peak

complete

mit

comparison

high

representationjoint

response

threshold

parallel

natural

adaptiveprovide

tree

report

cortexlog

difference

correlation

finite

analogenergydelay

note

active

researchpresented

study

represent

potential

path

technique

view

computer

samplebinary

term

rates

random

framework

chip

speed

individual

able

expert

constraint

descent

filter

addition

internal

memory

global

featuresestimateconsist

goal

likelihood

recurrent

source

unit

92

approach

spurious

covariance

center

shaped

john

49applied

117

processing

63

37

correctly

circuit

learnedcalled

bottom

max

saccadic

table

19

50

controller

extension

complexity

condition

find

author

student

identify

international

storage

52

defined

electrode

revow

achieved

476

theorem

curvature

ieee

changemunro

anti

discussed

vertical

maximum

cases

approximation

bounded

4

attempt

48

calculation

36

node

theory

3125

98

62

perform105

exist

trulytion

nearestprediction

applying

inversion

5

face

outcomes

demon 737

reliability

digital

stochastically

slope

deformation

confirm

sarsa

longest

smoothness

ann

widely

song

script

extent

analytical filtered

mahowald

inverting

spectra

resentation

incorporating

indices

institut

factorized

repetition

gas

pref

improvement

deviation

angular

surface

reached

89

reward

transfer

outperform

languages

cumulative

modalities

interesting

junction

jointly

sound

rumelhart

deletion

maze

hidden

470 denoted

million

equation

bishop

product

79

equivalent

assumed

9

sequential

claim

factor

derivative

retina

significantly

carlo

degree

separable

balanced

speech

42

length

66

distance

selection

inhibition

ing

morgancombination

mechanism

distributed

subject

mixture

limitoperation

word

fact

fixedtrue

regression

routine

probability

64

899

need

external

compare

purpose

structure

admit

59

actually

represented

spike

correct

propagation

allow

activation

left

world

supervised demonstrated

subtracted

thank

configuration

start

linearized

penalty

inequalities

cost

domain

704

local

array

22

27

empirical

general

search

institute

reinforcement

classes

measure

sum

images

computed

specific

larger

cross

spatial

test

82

according

trial

estimationpresent

obtainedinitial

assumptiondiscretegenerated

blockbrain

chosen proposed

zeromarkov

posterior loss

expected

contrast

resultingstatistic

scale

positive

consider

transformation

voltagetrain

states

policy

motionexperimental

equaleffect

mapping

bit

hmm

complexconclusion

motor

range

kernel

context

suggest

knowledge

minimum

important

transition

solution

matrixhigher

constant

occurpower locationintroduction

forward

orientation

element

observation

trained

dimension

computational

techniques

finding

environment

statistical

proceedingrelative

applicationrobot

channel

respect

samples

version

determined

annealing

perturbation

graph

357

studied

locally

explicitly

criticalgate

noise

dimensional

106

41

supported

melody

intelligence

randomly

mode sequentially

technical

moderate

average

21

approachesdue

rate

11

synapses

previous variable

curve

class

classifier

minimize

minimadepartment

39

assigned similarity

lateral

prior

row

abstractbayesian

141

136

23

705

88

map

142119

section

simulation65

29

34

classification

sequence

denote

expectation

cell

498

visual

786 interpolation

multiplication

123

acoust

worse

drive

dependencehebb

connectionist

294

considered

107

efficientphoneme

computing

task

cambridge

squared

revealed needed

principelosses

guidance

108

univ

meth

intractable

handledrop

meeting

utilizing

consistent

huberman

lyapunov

asymptotic

flow

implement

independent

transforming

howard

tract

towell

breast

developing

tip

templates

293

twenty

attention

evident

348

definition

boundaries

accurate

linearly

instances

widecoherence

vary

processes

parameterized

real

524

image

depicted

code

responses

normalize

generic

exp

piecewise

polynomial

partitioned

selective

reported

converges

32

effective eigenvalues

adaptation

phenomena

latticeoverlap

839

logp

visit

conic

843

310 interpretation

proves

cam

watson

danielkuhn

mcculloch

846

emgorganization

inactive

maximally

estimator

invisible

dim

vision

choices

conjugate

833

substituting110

solvesilicon

ten

moore

functional

cichocki

relaxation

texture

cal

development

ability

393

insight

chapman

conflicting

conversation

auditory

bind

456

easily

cursive

estimates

738

detail

juang

denccd

ca1

fly

735

explained

pittprobe

16

stripes

exploited

shrinkage

skin

599

eric

proximity

avoiding

fitness514

cohen

uniform

simulate

cessingroutines

384

cos

goodman

rotatingrepresen

736

kowalczyk

server

kurtosis

549

resistive

greatly

heavy686

infomax

transmitted

explain

333vii

521uniformlymillisecond

landspan

european intuition520

groupwhite

temporal

levelsimulated

matricestheoretical

follow

free

loop

dynamic

stable

establish

single

vlsi

head

fig

595

wish

leaves

kinetic

laird

617

warp

522

includes

violated

375

validated

extreme

special

financial

exploit

mdl

ties

force

596

523

685

sigmoidal

interaction

upper

acyclic

soft

linear

461

ignoring

huang

windpartly

spatially

pattem

302

fully

instantiation

acquire

principled

literature

gradually

840artificially

passenger

309

clause

128

175

green

polynomially outgoing

mistakes

suffices

loaded

739

usefulness

308advanced

locking

tactile

429erp

880469

surround

discovered

thc

cue

scenes

poggio

attempting

thousand

monte

recursively

lrta

largest

constrained

black

tel

arriving

antennal

periodically

telecommunication

progressively

efficiently

tal

tomita

predictor

suppress

neighborhood

codebook

settle

316translate

margin

844

murata

family

jackel

broken

synapse

perfor

405

lesion

sfmd

text

reflexrecorded

76

num

153

517 161

dendritic

closed

mutually

406

look

texas

media

mis

tesauro

policies

interneuron

activated

797

894

gating

articulation

constructive

185

schapire

success

analyse

summaryresultant

surgeon

profit

vehicle

chunk

drastically

err

dendrites

dof

regular

biased

903

iid

gene

conductance

halves

maxi

extensively

davis

legal

nelson

attractiondegenerate

actively

779

discount

deg

informa

hood

852

linked

855

permanent

414

lexicalcreation

shattered

lsa

tuned

histogram

lid

515

cho

determines

quality

853

competence

centres

files

306

purposes

late

linsker

simultaneously

glass

relationship

prove

triggered

limb

discuss

169

extra

understanding

sition

spie ment

box

propagate

nsf

bond

signed

medical

suffer

schedules

royal

animation

refinementconcerned

namicnervous

317

led

mosis

lookahead

accounting

ackley

triangular

movshon

occasionally

highlight

held

895

air

101

continuum

capable

separately

organized

versa

magazine

profiles

residual

rigorous

phenomenonremoving

kivinen

785

numerically

grey

sort

tresp

sphere

trading

temperatures

interpolate

identically

decide

harris

default

illumination

466

gmd

tic

lewis

episode

fair

constraining

boosted

sensing

skillquantitative

856

baltimore

eng

belonging

entries

pen

north

ahead

smoothed

passive

sherrington

anastasio

modularity

587

oct

artmap

mason

unity

sloan

secondary

segmenting

arrives

413

harvardschmidhuber

803

518

gathered

hearing

801

802

extractor

triangle

exponentially

232

530

condi

modelling

triple

umli

dyna

easy

heterogeneous

edelman

shot

decomposition

nearby

fund

516

851

incrementally

infinity

alexander

clutter

communicate

experiences

857

inferred

big

criticgibb

remained

531took

yang

interconnected

epsp

closely

uniquely

regardless

905

velocities

tim

tipping

tical

fraction415

olshausen

topology

rep

replace

opponent

united

calculates

neighbourhood

eigenvector

occam

neocortical

carrying

plotted

cardinality

707

replica

coordinate

project

manifold

detector

approximates

infinite

bower

monitored

determining

pes

162

cellular

appendix

integral

resulted

diameter

animal

evolutionary

synchronous

fail

varied

fol

presynaptic

barron

225 implies

231

stacked

simpler

recognize

171

782

bus

construction

preserving

foedyer

triangles

858

sweep

greedy

wolff

sorting

wait

hanson

invariance

850

list

mackay

ascent

adelson

598landscape

icassp

flash

literal

iff

periodicity

manuscript

mcnaughton

hyperbolic

pdp

fill

613

persistent

clustered

labelling

memoryless

travel

614cti

eligibility

jet

681

degraded

macaque

evolve

716

advantageous

rectangles

niranjan

invariances

invertible

picking

admission

blum

bility

academy

protocol

converged

372

373

centroid

characterize

pact

maass

lip

verlag

spacing

preserves

sponding

stimulating

schaal

restart

award

styles

shawe

compucones

683

supporting

thermal

request

marr

lyon

embedding

merzenich

682

rectangular

measurable

intra

nist

imple

programmed

717

locked

hamilton

replacing

hornik

discovery

conditioning

mance

job

fabricated

cumming

crl

nent

motoneuron

multiresolution

strategies

normalization

backward

spinal

induced

203

origin

evaluatedevaluate

possess

occluded

prone

normalizing

371

occupancy

nonstationary

829

jan

highest

warping

hansen

liuunlabelled

resolve discrimination viewinginfluence

obvious

leading

268

magic

molecule

acknowledges

vertebrate

658

swimming

realizable

slow

ensure656

specialization

watrous

tial

ming

judgment

levy

lnpminor

804

luminance

207

judged

baldi

simulator

657

art

479

additive

foveadurbin

axes

choosing

geometric

consisting

failure

sulu

destinationdifferentiating

electrodes

discontinuous

similarly

796

488

balancing

considerably

334

fold

alspectortransaction reasonable

focuses

smoothing

advantages

floating

417

review

800

syntactic

dimen

164

weather

jump

intensive

modulation

stability

hasselmo477

bipolar

formally

474

city

entrystrictly

emergenceassociation

intermediate

hertz

meir

gtm

fragment

relies

respective

involving

klaus

781

472

methodology

cord

stated

company

basin

logical

convention

digitized

799

463

separating

surprisingly

argument

783

albus

analyzing

stronger

scheduling

snn

service

turned

striate

reproduce

scan

advance

matter

qualitatively

465

concentration

specification

464

exposed

peripheral

primitive

inset

carries

utterance

784

meaningful

filled

tsitsiklis

softwareamenable

enhance

dominant

aic

reading

piece

disease

font

dec

challenge

geoffrey

cowan

approached

831

versus

squares

eliminate

zone

650

529

alex

floor

534

sim

aid

sustained

pointing

cast

actuator

transduction

failed

stat

528characterizing

baggingsunspot

routes

trigger

lda

robinson

854

kaelbling

aviv

iteratedarrellexclusively

hip

inst

excessive

synergy

bin

358

volker

ance

modify

832home

systematicallythresh

align

sharing

naval

ridge

surprising

instruction

suzuki

iti

fre

cruz359

establishing

dence

deliver

omohundro

pac

graphical

mul

pat

essen780

slower

tried

655

membership

obd

multiplicative

mel

predisposition

preserve

412

parietal

hamming

attained

distinction

contrary

798

dopamineplanar

411

guide

cybern

reasonably

oja

justification

initialization

negligible

423

distorted

knowing

771

november

placing

emergespotentiation

588

kramer

654

japanese

mod

fication

coherent

leaf

asymmetric

wang

geman

logarithmic

fan

conf

virtual

motif

purely

715

621

713

inclusion

mini

resonance

neuneier

excited

interior

concert

december

cylinder

cube

differently

eeprom

714

inconsistent

axe

cmm

623

beam

applicability

janus

amino

admissible

622

han

711

rose

naming

werbos

complementconnoracting

capacities

informatik

neurophysiol

shortest

ter

mental

motivation

hippocampus

cooper

elastic

ambiguity

422

inversely

pairing

odd

eqn

reasoning

bad

calcium

dictionary

retrieved

obstacle

lost

424

chang

chorale

651

893

contextual

access

653

454

597

unconstrained

904shadmehr

extending

elementary

successively

blocked

continues

426

timing

quantization

maintaining

colliculus

batch

cdm

facilitate

backup

lecun

subspace

imposed

caruana

article

primarily

fft

englewood

drifting

712

bold

enhances

788

format

initialize

mann

axonal

nasa

bregler

lond

iteratively

hme

ave

409

410

att

enhanced

analyzer

cone

408

489

806

pott

abilities

recognizes

canonical

published

491

expanded

ecc

january

dissertation

losing

453

vertices

saferiemannian

stonerare

doom

messages

coordination

n00014

204

acknowledgment

correspondence

computes

830

patient

grouping

accomplish

hubel drain

bss

heading

atick

flat

peterdiscarded

delivery

plasticity

antenna

aggregation

doubleinterpreted

workstation

satisfying

reality

reset

490

recurrence

exposure

rent

render

repository

ally

saturatingscenarios

satellite

quick

brainstem

vote

violation

gainedbank

grouped

tor

gorithmalgorithmic

425

coincide

robustly

skeleton

spiritimplicitly

radar

lation

547

proach

resolution462

discriminate

rotation

alter

illustration

saad

manipulator jij

laplace cartesian

combinedpre

label

72

mos

writing

eliminating

clustering

estimated

constitutes

main

7

sanconservative

frasconi

bution

exam

pressure

despite

multiplied

bring

519

chess

gluck

540

pieces

distances

berkeley

superior

designsplit

hinton

affect

452

69

hull

develop

290

inhibit562

stork

zipser

register

date

smola

converted

superposition

tail

hierarchies

589

24

ratio

pij

rough

remark

546

hofmann

dependency

setup

enable

handwritten

ensureshart

peterson

collective

merge

controlling

israel

734

sonar

acquiredfactorization

wahba

593

assembly

intuitively

convenient

ranges continueregularizer

knncoloradocontinually

442

predictive

flight

dialogue

mapped

giles

procedures

photoreceptordecomposed

execution

saccades

splines

databases

realized

cerebral

votingnotion

pan

affine

693

greater

heart

agent

gaze multilayermhz

generalizes

457

exploratory

variation

remaining proba

snr

201

592

top

period

minimizing

spikes

70

fall

249

rich

specifies

reliable

leg

gray

coin

modular

maximizing

473

arisingcollect

clas

door

los

pointed

rapidly

sleep

salient

surfaces

forecasting

pro

arbor

401

adaboost

echo

gridextraction

minimal

payoff

301

parsec

bertsekas

capturedaccept

548

emotion

roc

77

outlined

importance

feeding

buhmann

cia

ben

572

mail

modelled

holland

perceptton649

pathway

280

305174

emission

cpu

monitor

remote

littman

vestibular

222

633 lowe

likewise

candidates

corner

belong

overtraining

gen

fed

create

capability352

167

satisfied

conducted

composite scientific

kinematic

841

expanding

induction

phd

die

angle

pull

sentences

fine

shared

sufficespreadimplementingtechnologies

416

causing

construct

dis

46

reflect

147

faces

dynamical

kohonen

stream

board

routing

shift

154

video

spin

510

similarities

756

509

lvq

polarity

rollout

defining

criteria

197

374

online

multiplying

508

phonetic

unlikely

limitation

552

usage

205

constitute

boser

behaves

repeated

verification

determine

derive

51

especially

requires

connec

communication

15

latent

count

optic

slot

schemes

511

continuously

extracting

choose

contour

273

place

blind

quicklystarting

situation

illustrates

274

crucial

isolatedcognition

initialized

vlvp

middle

annual

leave

182

efficiency

identified

densities

bengio

refer

stop

against

acoustic

116

hall

command

exhibit

effectively

275

andreas

akaike

believed

acknowledgement

agency

quantifywatanabe

spline

diego slowly

scaled

286285

message

comparing

int

baseline

compression

183

benefitfaster

degradation

happen

poisson

332

vari

anal

associates

apparent

flip

approxiwilliamson

610

crosses

ation

attentional

competing

jaakkola

ranging

kluwerguaranteed

encountered

von

trying

include

solved

diagram

suppose

formulation

altered

flower

discounted

dendrite

truth

concern

introduced

regarded

maxima

assuming

steven

treatment

rejection

stimulated

suffix

mem abstraction

ion

route

tolerancelogarithm

dow

eigenfaces

falling

evolved

fimction

informative

argmax

bumptree

bach

765

467

beneficial

derivation

instantaneous

grammatical

winnow

irrelevant

adopt

accumulated

assigning

switching

suboptimal

served

suggestion

siu

sym

scanning

tensor

rivest

sontag

proposes

russell

reversed

rcc

757

addressing

histories

requiring

794

869

poorlyaperture

peaked

tance

viterbi

occlusion

interference

sgn

rhythm

795

418

864

scores

sanger

oscillate

changed

482awij

jones

399

fires

primate

constructingexpress

226

lack

callyexplore

indirectrefractory

grating

impact

layout

somsummed

forces

forcing

fittedengland

analyze

emerge

attributes

analytically

731

overlapping

sci

gated

dropped

kawato

dominates

adapt

diverse

employed

tem

553

conversion

converging

convenience

working

intrator

created

215

forced

contributes

core

comprising

prototype

separated

814

diffusion

somatosensory

synthesized

simplifies

245

trainable

composition

tones

deleted

normalisedperiphery

619

rain

thick

alphabet

403

engine

envelope

numerous

obey

pointer

das

page

nettalk

acm

barn

multivariate

patch

259

339

acad

satisfy

circlesfreedom

244

consideration

completed

677

alternatively

mauthner

phases

modifiable

complication

multimodal

genes

carried

clump

discontinuities

determination

mainly

clamped

significance

compensation

vanish

272

421

permit

ber

724

saul

effectiveness

depression

automata

mark

tuple

biol

res ipsilateral

irvine

marshall

summing

loosely

manually

419

homogeneous

impractical

397

mcdonnell

hemmen

huxley

vocabularyvariability

conditioned

james

concatenatedinspired

accurately

spa

shaping erlbaumsis

arrangementexecuted

alpha

senses

483878

cation

loo

konishi

theoretically

electronic

lit

phonepostsynaptic

mjolsness

identifying

877

buffer

chooses

inferior

insertion

francisco

766

ghahramani looking

dif

gaborphysiol

389

smallest

narrow

398

228

border

243

kth

666

classi

argue

christof

vanishing

ordinary

791

symposium

767

uno

stock

ylxsiemen

420

convolutional

publisher

plant

hat

sollich

fsm

perfectgratefully

825

multilinear

826

lying

magnification

meant

levin

chervonenkis

ignore

hair

bulb

chicago

sufficiently

color

partially

neocognitron

relational

statist

eegmerged

classifies

multidimensional

rational

qualitative

boxes

493

assumes350

automated

fahlman

applies

beat

351

cij

merging

699

red

cauwenbergh

cliff

catastrophic

tile

bat

assemblies

boulder

cohn

obstacles

blue

articulatory576

575

577

scheduler

signatures

quadrant

tive

mimic

round

oculomotor

guess

geometrical

eral

deweerth

824

singer

823

desirable

mentioned

mach

neglected

steering

870

relate

objectives

minsky

sized

reciprocalready

627

maxq

microphone

265

lsi

770

modern

utility

nih

broad

easier

restrictive

tation

edition share

dilation

816allocation

349

spoken

366

identifiesreveal

rise

waibel

deficit

foreground

hinge

transformed

ild

492

counted

266

thought

728

parity

bridge

appealing

chaoticbarber

betaappearing

lab

361

management

362

239

620

subnet

adjustable230

proposemultiplier

accumulation

substitution

600

deconvolution

worked

dayan

848

boosting

stuck

correction

199

goes

257

analogous

quantity

noting

portfolio

afferent

retained

fluctuation

maintain

layered

258

router

scalar

mixedaverages met

fourier

frequently

618

363

memo

inductive

sides

252

ity

equivalence

abeles

counter

bartlett

lippmanfood

noiseless

wire

decoderrelease

aircraft

fixation

displacement

862

250

track

484

interpret

exploring

325andrew

kalman

overfitting

tracker

bonn

hypotheses

toronto

evolves

cmac

specializedpenalizes

scott

891

regard

reconstructed

441

scoringcollection

passes

227

324

187

perfectly

neurobiology

infer

compares

727

schedule

governed

contribute

competitive

adaptively

contained

726

amplitudes

402

searching

radius

phrase

480

disjoint

visualization

artifact

autonomous

token

visually

388fet

876

rapid

229

scatter

littlestonemation

formingtransparent

essence

gesture

imaging

govern

868

basically

543866

tween

realistic

soc

distinguishedadjustmentpractical

icapulses

compressed

horizon

representingorganizing

focused

making

432

think

constancy

structured

recall

circuitry

approximated

sensitivity

epoch

alignment

863

biases

behaviour entropic

moving

yellow

tone

quantized

566

retraining

distinguish

saving

voiced

swing

svd

param

replication

664

zoubin

physic

telephone

built

convolved202

experience

physiological

56

sam

195

removeswise

notation

side

employ

mateo

dependencies

seek

wilson

lated

542

steepest

545

transducer

intact

focusing

vmos

grbf

433

toy

unified

444

february

connect

chen

ron

360

spontaneous

spherical

check

96

focus

band

heuristic

expansion

cerebellumdirected

locomotionpitch

875

reader

player

431

289

analoguepseudo

achievable

age

gave

partial portion

rely

spot

pasadena

phonemes 392

helpful

graded

673

retinotopic

recursive

optimality

wrong

adapted

wavelength

pouget

909

675

oregon

radio

oscillating

housing

recursion

ordered

treat

665thresholded

oxford

psychological

minimization

occurrence

volt

transparency

tit

157

recovering

schuster

tenth

tain

respond

sparseness

equa

totally

rivalry

algo

putation

raises

892

primitives

626

827

differentiate

156

865electrophysiological

enforces

retrieval

existing

der

effort

reviewed

248

conduction

opt

tenenbaum

thickness

textures

811

mistake

modulated

person365

triangulation

810

neurobiological

stores

water

809

shop

preprocessing

limiting

828

tool

stm

364155

surprise 267

closest

zador

planning

programmable

ray

realization

reproduced

tively

restricting

siam

uniqueness

diabolo

approximator

eighth

dean

cauchy

agreement

787

detecting

upward

exon

recovery

789

constrain

offset

voice

acetylcholine

eliminates

mass

predicting

categorical

confined

874

functionally

pearl

particle

rangarajan

characterization

convolution

anomaly

perhap

pronunciation

positives

872

latency

prune

separates

unimodal

337

traffic

478

793

tsoi

screen271

protein

hodgkin

fashion

859

871

predefined

poly

860

combinatorial

microscopicchanceautomatic

lowest

correcting

workspace

curved

kailath485

goodness

ltp

fourth

studying

pollack

tries

monotonically

trol

242

moment

dealing

children

pronounced

pop

rosenberg

psychophysic

intelligent

addresses

breiman

pearson

propagates

organism

hpnn

outlier

fabrication

repre

400

carpenter

drawbackoperational encoder

formant

voronoi

feasible

defines

anterior861

receive

pbil

unusual

transpose

lag

diagnosis

systematic

zhang

indicator

247

paris

goldberg

625

analyses

augmented

provension

gold628

inde examine

vanishes

fisher

baxter

investment790

708

localized

vertically

issue

simplified

saliency

191

eyes

221

wileyinter

fault

cmos

transcription

stimulation

technion

received

answer

intrinsic

investigatefiber

insert

220

intrusion

prob

intracortical

176infeasible

guyon

taylor

est885

720

navigation

generalisation

matched

included

profile

moved

existence

629

wolpert

add

bourlard

articulator

plane

186

synchrony

nonparametric240

observing

receptordotted

dynamically

parametrized

recollection

pay

867

contract

873

conjunction

eeckman

physicallyinducing

game

tended

ous

gap

tube

downward

singular

849

recogni

straightcrossing

burst

reach

absence

sch61kopf

nowlan

468

asynchronous

decoding

coarse

growth

receives

fundamental

stereo

703

learnt

turning

severe

judd

tinunable

amplifier

consonant

subscript

omitted

edited

ballard

contraction

238

241

neuroscienceexploiting

uni

intended

displayed

successful

307

generatordivision

mixing

demand 842

capabilities

lippmann 287

established

ranged

455

relaxed

188 tern

tracking

ideal

deltaacceleration

pulse

ary

rotated

kullback

hessian

factorial

spatio

returnhastie

security

cholinergic

preliminary

pool

879

tishby

contact

integrator

prop

book

fish

gilbert

act

experi

replaced

harmony

notes

odor

tag

583

tour

rank

701 ullman

describing

creates 237

body

movestom

updated

tractable

752

executing

echoes

elevator

hamiltonian

interaural

hypothesized

reflected

hoc

sweeping

prefer

vor

graf

union

incoming

700

picture

classifying

biophysicalrequirement

iterative

decreased consistently

driving

euclidean

inverted

recognized

chunking

cur

covariances

584

indication

328

bag

sharp

dfa

queries

beginning

hilbert

comp

disturbances

compo

749maryland

rectified

mfa

meg

modulatory

890

ftp

ltmhubbard

adversary

translated

welch

manual

walk presentation

backgammon

329

production

empty

668

557

compete

switches

751

750

superimposed

thermodynamic

familiarity

unseen

boolean

symmetry

duration

generative

354

height

springer

196

notice

genetic

applicableebnn

want

device

detailed developmentaldiscretized forgeries fraud

159

adam

arg

frequent

tspprocessed

expressed

distractor

fewer

eigenvalue

dataflowdelayed

327

236206

proceed

compartmentsity

slices

323774sen

hyperplaneanatomy

splice arpa

amacrineamerica

812

assistance

impose

discontinuity

dominated

eter

discriminative

dominance monkey

utilize

322

categorization

regulation

priority

563

326496

exceed

intensities

586

equality497

qos

maintained

parzen

execute

distribu

seed

fellowship

car

diagnostic

employingedinburgh

neurophysiological

hyperparameter

diamond

maximized

marker

cpg

aston

isotropic

observablebinding

composed

singh

simplicity

put

traditional

recognizer

jacob

blob

accordance

anna

distri

678

landmark

579

passing

578

guided

ease

frontal

olfactory

discretizationdiscover

labeling

robustness

interconnection

kearn

opper

624

emphasizednames

tune

580fea

atkeson

cmu

empirically

elman

immediately

331

flexible

load

interact

microstructure

mismatch

multinomial

copy

malsburg

lot

selecting

post

526

680

char

apg

679

const

house

cubic

chris

decorrelation

talker

bridle

switched

dense

counterpart

consequences

tension

afosr

arc

category

prioritized

mpc

663

oscillatory

nested

capture

additionally

pos179

earlier

wall

interpolated

81

idea

cycle

nat

90

gerstner

annealed

arrangedrecovered

192

terminalmechanical

australia

usa

false

lapedeslawrence

regarding

introduces

attractive

branches

climbing

weakly

info

607

608

simplify

funded

foundation

typical

dark

ideally

consequence

comes

525

dividing

fired

ocular phy

german

129

depending

hierarchical sec

609summarizesadded

edge

expect

relates

insufficient

54

conventional

oscillator

pruned

pass

industrial

influenced

exactly

firmpca

utilized

ucsd

disorder

episodes

attached

contralateral

august

exhibited

nonlinearities

onr

realize

rubin

poses

preserved

july

finer

561

attempted

concentrated

economic

asked

pub

hopf

hid

street

speedup

speaking

phonemic

outperformed

subsequently

planes

observables

parti

718

527

possibilities

deep

582

ple

581

heavily

handling

94

chauvin

clauses

valued

dimensionality 695

site

modes

miller

ca3

reason

639

clinical

reduce

epsilon

166categories

604pomdp

broadly

radial

australian

isodeterminant

differing

303

ramtermination

neumann

296

640

hancock

summarized

leaming

attribute

commonlyopti

fsa

mathematic

correlogramoptimisation

devices

relevance

helmholtz

labeledwavelet

geometryedward

boston370

638

573

increase

eigenfunction

listed

willshaw

rewritten

attractor

monocularparse

topography

corrupted

181

layercontribution

mean

rissanen

overhead

quasi

permutation

jose

paul

port

hypothesis

worth

surveyprogram

sparse somatic

reflection

society

cycles

metricamplitude

uncorrelated

allen

behave

propagating

con

tangent

address

reduction

modulesdavid

sciences385

invariantfactored

formulated

enhancement 495

interface

gestures

coolen

mostafa

globally

pdf

383

alway

causal

imply

equivalently

call

stage

geiger

signature

compensate

entropy

exception

convergent

anatomicalconvex

smoothly

yale

208

infor

representative

resource

jain

fusion

acknowledge

535

deriving

bialek

annal

exponent

normal

125modified

connectivity

optimized379

require

naturally

sure

scaling

636

parameterizationguarantees

bandwidth

examining

163acquisition

limyielding

zemel

567

func

gordon

168

increased

262

30

enables

sign

synchronization

presenting

germanywesley

traces

irregular ture

viewpoint

popular

caltech

differentiable

ral

converter

iiii

achieve

448specifically

376

ensemble

tracesteadyvalid

actor

hybrid

redundancy

avoid

sensitive

261

moody

front

individually

cole

obermayer

cope

separator

676smo

stay

dax

882

vice

565

bounding

terrence

subnetwork

shannon

imagine

consumption

valiant

288

picked

stress

quinlan

tanh

carefully

conclude

378

tell

cooperative

confusion

544

fuzzy

yuille

temporally

390

suppression

encoded

rat

cascade

660

promising

smithdog

adjacent mar

mechanic null

session

stem

directionalowl

excitation

198

corporate

preferred

146

295

inherent

generation

exemplar

circular

clearly

operate

arithmetic

price

frequencies

661

berlinmea

establishes

representa

frames

electricalmolecular

capacitanceformed

scenario

284

expensive

audio

615

rectangle

uncertain

reaction

vec

stroke

insect

macroscopic

substantial

exper

fiat

challenging

demonstrate

diag

commercial

drift

roles

tied

elemental

largely

regularized

564

multiscale

jacobian

616

diffuse

evoked

weak

computationally

integer

driven

perceptual

391

reversal

605

properly

reverse

494

redundant

searches

formal

burges

pyramidal

biology

independence

highly

formula

universal

optimum

perspective124

synthesis

sensor

projected

prototypes

355

implication

perpendicular

tank

outer

mobile

koch

hippocampal

www

381

mind

347

neocortex

planck

mouth

slight

meet

solla

346

ranking

distal

markovian

533

controlledacid

achieving

activities180

380

367

ask

minus

formance

storing

geniculate

regimes

rnn

takesmagnetic

engineering

calculated

nec

trees

ocr

cybernetic

branch

26

805

movellan

understood

incorrect

concerning

leenharmonic

description

generate

college

transmitter

91

193atr

expression

vestibulo

link

hardware

exponential

ith

comparable

377

moss

estimating

room

335

softmax

trade

extended

consistencyhelp

studies

83

velocity43

shape

measures converge

percepthour

84

statement

http

sophisticated

girosi

adapting

experimentally

solving

segmentation

dorsal

missing

event

denver

637

fast

repeat

emphasisauxiliarysetpoint

mediumconflict

uci

producingpatches

coming

curse694

mathematicalsites

covering

nonzero

thinleaving

554

douglas

aij

lookup

sigma

depth

burr

lett measuring

mitchell

reconstruct

729

ble

tibshirani

ward

intensity

subthreshold

store trajectories

generalized

incorporate

oscillation

partition

fairly

sense

entire

sufficient

bar axis

aspect

month

son

child

var

spanned

568 probable

discharge

ent

validity

segregation pomerleau

coded

corrective

semi

describes

japan

teacher

diagonal

shorter

machines

differential

majority

sigmoid

439

hyper

incorporation

318

excellent

ambiguous

monotone

injected

574

ferent

643

alternatives

role

sig

careful

sketch

increasing

surrounding

sentation

555

570

nition

minimizes

principle

improved

mammalian

556

algebrainfinitely

plate

community

645 aaai

ultimately

convert

computa

certainlywake

adequate630

tanaka

weaker

regularization

english

potentially

completion

architectural

tau 733434

acceptable

accounted569

solely

verify

decreasing

accepted

avoidance571

dichotomy

murray

wiesel

pole

simultaneousdichotomies

299

594

addison

algebraic

combines

expense

female

sompolinsky

keeping

connecting

balance

elsevier

performances

aligned

blake

541yes

652

identityincreasingly

ratios184

physiology

essentially

smoother

strict

vapnik

471

serve

material

300

law

449

427

slicedolphin knot

explicit

barrier

intl

freundcomplementary

nucleus99transconductance

resources

stopping

644chordimplicit

tissue

market

writergener

suitable

ure

172arise

647

sheet

interspike

428

ring

631881

189

multiplysung

influencesspaced

raised

exclusive

proved

copies

started

certainty

birmingham

532

336

retain

cochlear

pyloric

grossberg

ball

conceptual

512

occurrences

769

conditionally

443

breaking

restoration

specificity

incomplete

consisted

resampling

star

freeman

zip

hyperplanes

doya

ence

produces

encouraging

768

generalizing

potassiumshadingobviously

perceived

mask

lane

overview

95

divide

overcome

430

bead

702

assessed

kaufman

stephen

386

understand

princeton

investigated

graduate

170

platt

extract

discovering

pin

drawing

lighting

council

subscriber

solves

fom

tradeoff

successfully

indexed

termed

struc

646

grateful

thresholding

rabiner

purkinje

locate

632

misclassification

correlate

bird

orr

varies

salk

omit

onset

achieves

majoractivate

interested513

ill

treating

343

ongoing

sun

345

avoided

species

344

comment

platform

outcome

inf

ideas

palmer

autoregressive

embedded

formation

nil

generarive

743

laser

merit

modifying

mining

ponent

unfortunately

winning

indepen

lisberger

shepard

induces 742

summarize

torque

apparently

precisely

wider

read

functionality

bear

austin

andreou

698worst

fir

anthony

bottleneck

emergent

bootstrap

capital

verb

cumulant

bottou

distinguishing

bracket

763

vis

generaliza

assessment

incorrectly

fin

nuclei

fractal

304659

ear

jabri

uncertainty

equalizer

150 enforce

spiking

boyan

696

massively

bienenstock

mcmc

fox

encode

884

played

inaccurate

883

investigator

hole

297

stages

dataset

physical

refined

585

publication

originally

adjust

simard

martin

recog

posteriori

tij

insensitive

lagrangian

learnable

enter

674

noun

satisfaction

esti

loading

excess

127

shavlik

softassign

simplifying

641

matlab

replacement

evolution

indexing

neglect

rochester

psom

910

preferences

ini

rejected

1 inform

744

predictable

auto

marder

indi

vocal

responsibility

748 andersen745

architec

september

friedman

london

126

preprintplaces

606

projecting flexibility

parallelism

ritter

binocular

691

alvinn

spite

692

utilizes

benchmark

722

soon

discriminant

showingvariant magnitudes

813

art2

metamackey

200

positioning

letting

340

brown

complicated

collectedgrowing

sian

penalized

pairedequally

nip

246june

plausible

rao

academic

611

pentland

physiologically

laplacian

handbook

disturbance

substitute

255transmit

electron

tokyo

256

modal

gram

pearlmutter

253

ground

msedifferentiation

lobe

france

481

740

concrete

compositional

tech

characterizes

ization

harder

muscles

collision

gauss

capacitor

alternate

merely

houk

west

gence

automaton

lecture

id3

codeword

families

rewrite

remarkable

ically

cover

hyperpolarization

exploitation

iiiii

optimizing

consuming

solvable

calculating

classify

lazzaro

cooperation

dirichlet

math

ijcnn

holding

anderson

distortion

hint

extremely

successive

792

254

amari

correlates

wij

talk

sided

classified

177

observer

recording

johnson

strongly

chemical

circumstances

264

131

branching

approximating

vertex

697

structural

precise

stereogram

132

spaces

manipulation

widrow

hughes

rithm

bandpass

discusses

mellon

examination

failures

percentage

demonstrating

precision178absolute

quantities

251

corn

propulsion

quantizer

learnability

817

spring

591

host

log2

821

linking

ionic

jitter

jersey

implying

write

sees

723

sensorimotor

generalize

sin

providing

rest

incremental

parametric

wave

automatically

updating

lowpass

820

illusory

822

poor

imation

visible

parent

spiral

usual

alan

satisfies

selectivity

dual

mcclelland

variational

oriented

maximization

orthogonal

root

axon powerful

213

difficulty

sampled

young

223

momentum

475

bifurcation

unlike

rtrl

attributed

746

scanned

sake

721

764

subsystem

saturated

762

scalp

agrees

snapshot

supplied

761

instability

push

medicine

isolation

raammarcus

nontrivial

760

opinion

peginferring

selectively

responding

subtask

synchronized

tendency

simplification

612

generafive

260

819

transient

offer

neighboring

explored

office

possibility

lin

combine

panel

224

issues

pursuit

214

818

nearly

219

waveform

gaus

devised

denominator

deter

bergen

815removed

difficulties

inequality

lagrangereaching

minutes

404

linearity

simplest

familiar

linguisticmedian

alarm

impulse

impossible

detected

guarantee

ignores

horiuchi

horizontally

hrr

explanation

sinusoidal

interestingly

attended

comer

rhythmic

jolla

ate

139

758

evenly

property

holmdel

unique

edu

111handled

suited

ory

cognitive

incorporates

expand

xly

optical

scales

hypercolumn

369

inefficient

unbiased

week

642

reversible

ruppin

practically

shaded

successor

unpublished

ruderman

charles

cochleabreak

substantially

circle

330

cavity

natlcentre

distantplasma

arrivaloutline

pittsburgh

ples

889

henderson

368759

cancel

730

tonic

747

impulses

calibration

fat

localization

212

211

involved

divided

cues

workshop

arises

posed

automation

341

iri investigating

carnegie

option

gender

217

210dudadisadvantage

misclassified

disparity

periodic

history

218

baird

involve

american

conductances

markedintroducing

treated

modality

mead

lgn

nal

stationary

interacting

viola

virtually

proposal

phrasesadult

nonetheless

descending

implied

dsp

sta

getting

648

knudsen

formulae

thesis

pyramid

researcher

behavioral

387extensive

ran

mai

lewicki

keyword

888

fix

restrict

recognizing

tional

region

lines

rameter

playing

allocated

library

saddle

relating

simplex

illinois

laboratories

spreading

legendre

sentenceinside

robert

deal

formalism

file

serves

807

king

neurosci

increases

732

shifting

planner

ping

435

resppami

nmda schulten

coupled

887

abbott

mst

pineda

445

sebastian

carry

biological

436

illustrated

syllable

634

topic

caused

msec

319

straightforward

strategy

candidate

remainder

occurring

angles

verified

fire

topological

formulate

calculate ical

retinal

shiftedconjecture

filtering

modeled

jack

886

denotes

scopereliably

103

optimally

obtaining

jth

warmuth

inhibitory

mori

hope

content

intuitive

randomized

corollary

satisfactory

dietterich

win

437

backpropagation

eigen

april

newton

interactive

simd

drf

amplified

initially

dtw

407

energies

coefficient

east

fulldrug

exploration

manner

referencestrong

corporation

558

differ

generality

acousticallang

resembles

accumulate

subspaces

emphasizesynthetic

reinforce

light

688

438

662

fortunatelypendulum

display

theories

tributionresting

smolensky

autocorrelation

visited

resistor

saccade

backprop

appropriately

635

338

slopes

coordinates

tight

representational

labelled

773

krogh

operates

320

440

brightness

confirmed

autoencodercare

baum

compact

ordering

barlow

reflecting

convey compatible

seattle

trend

favor

denoting

bill

psychophysical

creatingdarpa

cleanreplicated

christopher

bump

canada

laboratory

ekf short

mutual

occurred

handwriting

rohwer

590

camera

running

waves

446

elimination

450

321

601

summerseung

horn

syllables

closer

spatiotemporal

exchange

667

603

schwartz

reducing

yielded

148

printed

opposedsimulating

datapoint

increment

strength

intersection

hypercube

constituent

segmented

cosine

play

khz

unchanged

endpoint

card

importantly

musical

hassibi

mediated

integrating

hough

comparatordom

focal

canal

fastest

compensated

clarity

cun

feasibility

chick

removal

games

monitoring

viewed

775

hillsdale

447gammon

upright

ibm

cerebellarefficacy

denmark

suggested

dramaticfij

continuity

march

sharply

disappear

772

mathematically602

jutten

master

abu

schematic

209cor

reaches

open

mountain

frey

involves

559

par

personal

560

winner

changing

382

symbolsymbolsymbolshuntingshuntingshunting

givebases

feedforwardfeedforward

centeredcenteredcentered

leibler

matches

temperaturetemperaturetemperature

committeecommitteecommittee

kind

proportionalproportionalproportional

158

85

776

transfer

sequentialsequentialsequential

strokes

834

bilinear

violatedviolatedviolatedviolated

worse

validatedvalidatedvalidated

123

cambridgecambridgecambridgecambridgecambridge

genericgenericgenericgeneric force

com

mozerpiecewise

partitioned

Recovered Latent Tree

“Training” Neighborhoodas in the blue dashed

circle

Figure 4: Estimated NIPS top 5000 keywords global hierarchical structures. Red nodes are latent nodes introduced. The blue dashcircle in the global hierarchical structure is zoomed in which is the neighborhood of word “training”.

Page 3: Appendix:Guaranteed Scalable Learning of Latent Tree Mode lsauai.org/uai2019/proceedings/supplements/319_supplement.pdf · interpreting smooth score background clues architectures

result

output

weight

unit

error

problem

1

method

number

critical

training

model

vector

neuron

input

function

system

set

data

pattern

neural

network

algorithm

parameter

learning

empirical

international

bottomtraining

early

strategy

paradigm

decision

resistive

flow

proposition 374

eigenvalues

genetic

cat

law

decreasingprove

answer

225

track

received

histogram

fiber

exponentially

230

critic

gibb

231

variability

greedy

biol

270

traffic

intrinsic

explore

207

electrical

koch

convex

formation

frames

approximated

preliminary

expansion

biases

95

focus

successfully

171

straightforward

evolution

extract

behaviour

recognize

fine

specifically

169

extra

minimization130

updated

latent

generation

organizing

partial

physic

produces

pathway

ica

sensitivity

94

manner

extraction

typical

Figure 5: Extended neighborhoods of some of the words estimated from the NIPS dataset. We run our algorithm for the top 5000keywords global and local hierarchical structures.

secur

36

studi

323

administr

effort

324

consid

govern

leav

wordmovi

rais

result

37

killserv

estat

38

produc

posit

charg

welcom

Figure 6: A local view of recovered NY Times keywords structure

Page 4: Appendix:Guaranteed Scalable Learning of Latent Tree Mode lsauai.org/uai2019/proceedings/supplements/319_supplement.pdf · interpreting smooth score background clues architectures

D EXPERIMENTS ON HEALTHCARE DATASETS

Quantitative Analysis We evaluate our resulting hierarchy with a ground truth tree, based on medical knowledge2.We compare our results with a baseline: the agglomerative clustering. Evaluating the performance of tree structurerecovery is indeed nontrivial as the number and location of the hidden variables vary as well as the depths. Twotrees may be similar but may look different due to the difference that how refined the hierarchical clusterings are.The standard Robinson Foulds (RF) metric ((Robinson and Foulds, 1981))(between our estimated latent tree and theground truth tree) is computed to evaluate the structure recovery in Table 3. RF is a well defined metric for evaluatingthe difference of two tree structures. Similar to other distance metrics, smaller RF indicates the recovered structurebeing closer to the ground-truth.The smaller the metric is, the better the recovered tree is. The proposed method isslightly better than the baseline and the advantage increases with more nodes. However, our proposed method providesan efficient probabilistic graphical model that can support general inference which is not feasible with the baseline.

Data p RF(agglo.) RF(proposed)

MIMIC2 163 0.0061 0.0061CMS 168 0.0060 0.0059

MIMIC2 952 0.0060 0.0011

Table 3: Robinson Foulds (RF) metric compared with the “ground-truth” tree for both MIMIC2 and CMS dataset. Our proposedresults are better as we increase the number of nodes.

Qualitative analysis Here we report the results from the 2-dimensional case (i.e., observed variable is binary). InFigure 7 in appendix D.2, we show a portion of the learned tree using the MIMIC2 healthcare data. The yellow nodesare latent nodes from the learned subtrees while the blue nodes represent observed nodes(diagnosis codes) in theoriginal dataset. Diagnoses that are similar were generally grouped together. For example, many neoplastic diseaseswere grouped under the same latent node (node 1135). While some dissimilar diseases were grouped together, thereusually exists a known or plausible association of the diseases in the clinical setting. For example, in Figure 7 inappendix D.2, clotting-related diseases and altered mental status were grouped under the same latent node as severalneoplasms. This may reflect the fact that altered mental status and clotting conditions such as thrombophlebitis canoccur as complications of neoplastic diseases ((Falanga et al., 2003)). The association of malignant neoplasms ofprostate and colon polyps, two common cancers in males, is captured under latent node 1136 ((Group et al., 2014)).

For both the MIMIC2 and CMS datasets, we performed a qualitative comparison of the resulting trees while varyingthe hidden dimension k for the algorithm. The resulting trees for different values of k did not exhibit significantdifferences. This implies that our algorithm is robust with different choices of hidden dimensions. The estimatedmodel parameters are also robust for different values of k based on the results.

Scalability Our algorithm is scalable w.r.t. varying characteristics of the input data. First, it can handle a large numberof patients efficiently, as shown in Figure 9(a). It has a linear scaling behavior as we vary the number observed nodes,as shown in Figure 9(b). Furthermore, even in cases where the number of observed variables is large, our methodmaintains an almost linear scale-up as we vary the computational power available, as shown in Figure 9(c). So, byproviding the respective resources, our algorithm is practical under any variation of the input data characteristics.

D.1 Data Description

(1) MIMIC2: The MIMIC2 dataset record disease history of 29,862 patients where a overall of 314,647 diagnosticevents over time representing 5675 diseases are logged. We consider patients as samples and groups of diseases asvariables. We analyze and compare the results by varying the group size (therefore varying d and p).

(2) CMS: The CMS dataset includes 1.6 million patients, for whom 15.8 million medical encounter events are logged.Across all events, 11,434 distinct diseases (represented by ICD codes) are logged. We consider patients as samplesand groups of diseases as variables. We consider specific diseases within each group as dimensions. We analyzeand compare the results by varying the group size (therefore varying d and p). While the MIMIC2 dataset and CMSdataset both contain logged diagnostic events, the larger volume of data in CMS provides an opportunity for testing

2The ground truth tree is the PheWAS hierarchy provided in the clinical study ((Denny et al., 2010))

Page 5: Appendix:Guaranteed Scalable Learning of Latent Tree Mode lsauai.org/uai2019/proceedings/supplements/319_supplement.pdf · interpreting smooth score background clues architectures

the algorithm’s scalability. We qualitatively evaluate biological implications on MIMIC2 and quantitatively evaluatealgorithm performance and scalability on CMS.

To learn the disease hierarchy from data, we also leverage some existing domain knowledge about diseases. In partic-ular, we use an existing mapping between ICD codes and higher-level Phenome-wide Association Study (PheWAS)codes ((Denny et al., 2010)). We use (about 200) PheWAS codes as observed nodes and the observed node dimensionis set to be binary (d = 2) or the maximum number of ICD codes within a pheWAS code (d = 31).

D.2 Results

Case d =31: We learn a tree from the MIMIC2 dataset, in which we grouped diseases into 163 pheWAS codes and upto 31 dimensions per variable. Figure 8 in appendix D.2 shows a portion of the learned tree of four subtrees which allreflect similar diseases relating to trauma. A majority of the learned subtrees reflected clinically meaningful concepts,in that related and commonly co-occurring diseases tended to group together in the same subtrees or in nearby subtrees.

We also learn the disease tree from the larger CMS dataset, in which we group diseases into 168 variables and up to31 dimensions per variable. Similar to the case from the MIMIC2 dataset, a majority of learned subtrees reflectedclinically meaningful concepts.

Figure 7: An example of two subtrees which represent groups of similar diseases which may commonly co-occur. Nodes coloredyellow are latent nodes from learned subtrees.

E ADDITIVITY OF THE MULTIVARIATE INFORMATION DISTANCE

Recall that the additive information distance between nodes two categorical variables xi and xj was defined inChoi et al. ((2011)). We extend the notation of information distance to high dimensional variables via Definition 4.1and present the proof of its additivity in Lemma 4.2 here.

Page 6: Appendix:Guaranteed Scalable Learning of Latent Tree Mode lsauai.org/uai2019/proceedings/supplements/319_supplement.pdf · interpreting smooth score background clues architectures

Figure 8: An example of four subtrees which represent groups of similar diseases which may commonly co-occur. Most variablesin this subtree are related to trauma.

0 0.5 1 1.5

·106

0

2,000

4,000

6,000

8,000

Number of samples

Runnin

gti

me

(sec

onds)

(a) Running time vs Number of samples

0 200 400 600 800 1,0000

0.5

1

1.5

·104

Number of observed nodes

Runnin

gti

me

(sec

onds)

(b) Running time vs Number of nodes

0 20 40 600

20

40

60

Number of threads

Spee

d-u

pfa

ctor

(c) Speed-up vs available threads

Method speed-up

Ideal speed-up

Figure 9: (a) CMS dataset sub-sampling w.r.t. varying number of samples. (b) MIMIC2 dataset sub-sampling w.r.t. varying numberof observed nodes. Each one of the observed nodes is binary (d = 2). (c) MIMIC2 dataset: Scaling w.r.t. varying computationalpower, establishing the scalability of our method even in the large p regime. The number of observed nodes is 1083 and each oneof them is binary (p = 1083, d = 2).

Proof.

E[xax⊤c ] = E[E[xax

⊤c |xb]] = AE[xbx

⊤b ]B

Consider three nodes a, b, c such that there are edges between a and b, and b and c. Let the A = E(xa|xb) andB = E(xc|xb). From Definition 4.1, we have, assuming that E(xax⊤

a ), E(xbx⊤b ) and E(xcx⊤

c ) are full rank.

dist(va, vc) = − log

k%i=1

σi(E(xax⊤c ))

&det(E(xax⊤

a )) det(E(xcx⊤c ))

e−dist(va,vc) = det'E(xax

⊤a )

−1/2U⊤E(xax

⊤c )V E(xcx

⊤c )

−1/2(

where k-SVD((E(xax⊤c )) = UΣV ⊤). Similarly,

e−dist(va,vb) = det'E(xax

⊤a )

−1/2U⊤E(xax

⊤b )WE(xbx

⊤b )

−1/2(

e−dist(vb,vc) = det'E(xbx

⊤b )

−1/2W⊤E(xbx

⊤c )V E(xcx

⊤c )

−1/2(

Page 7: Appendix:Guaranteed Scalable Learning of Latent Tree Mode lsauai.org/uai2019/proceedings/supplements/319_supplement.pdf · interpreting smooth score background clues architectures

where k-SVD((E(xax⊤b )) = UΣW⊤) and k-SVD((E(xbx⊤

c )) = WΣV ⊤).

Therefore,

e−(dist(a,b)+dist(b,c)) = det(E(xax⊤a )

−1/2U⊤E(xax

⊤b )E(xbx

⊤b )

−1/2−1/2E(xbx

⊤c )V E(xcx

⊤c )

−1/2)

= det(E(xax⊤a )

−1/2U⊤AE(xbx⊤b )B

⊤V E(xcx⊤c )

−1/2) = e−dist(va,vc)

We conclude that the multivariate information distance is additive. Note that E)xax⊤

b

*= E

+E+xax⊤

b |xb

,,=

E+Axbx⊤

b

,= AE(xbx⊤

b ).

We note that when the second moments are not full rank, the above distance can be extended as follows:

dist(va, vc) = − log

k%i=1

σi(E(xax⊤c ))

-k%

i=1σi(E(xax⊤

a ))k%

i=1σi(E(xcx⊤

c ))

.

F LOCAL RECURSIVE GROUPING

The Local Recursive Grouping (LRG) algorithm is a local divide and conquer procedure for learning the structureand parameter of the latent tree (Algorithm 1). We perform recursive grouping simultaneously on the sub-trees ofthe MST. Each of the sub-tree consists of an internal node and its neighborhood nodes. We keep track of the internalnodes of the MST, and their neighbors. The resultant latent sub-trees after LRG can be merged easily to recover thefinal latent tree. Consider a pair of neighboring sub-trees in the MST. They have two common nodes (the internalnodes) which are neighbors on MST. Firstly we identify the path from one internal node to the other in the trees tobe merged, then compute the multivariate information distances between the internal nodes and the introduced hiddennodes. We recover the path between the two internal nodes in the merged tree by inserting the hidden nodes closelyto their surrogate node. Secondly, we merge all the leaves which are not in this path by attaching them to their parent.Hence, the recursive grouping can be done in parallel and we can recover the latent tree structure via this mergingmethod.

Lemma F.1. If an observable node vj is the surrogate node of a hidden node hi, then the hidden node hi can be

discovered using vj and the neighbors of vj in the MST.

This is due to the additive property of the multivariate information distance on the tree and the definition of a surrogatenode. This observation is crucial for a completely local and parallel structure and parameter estimation. It is also easyto see that all internal nodes in the MST are surrogate nodes.

After the parallel construction of the MST, we look at all the internal nodes Xint. For vi ∈ Xint, we denote theneighborhood of vi on MST as nbdsub(vi;MST) which is a small sub-tree. Note that the number of such sub-trees isequal to the number of internal nodes in MST.

For any pair of sub-trees, nbdsub(vi;MST) and nbdsub(vj ;MST), there are two topological relationships, namely over-lapping (i.e., when the sub-trees share at least one node in common) and non-overlapping (i.e., when the sub-trees donot share any nodes).

Since we define a neighborhood centered at vi as only its immediate neighbors and itself on MST, the overlappingneighborhood pair nbdsub(vi;MST) and nbdsub(vj ;MST) can only have conflicting paths, namely path(vi, vj ;N i) andpath(vi, vj ;N j), if vi and vj are neighbors in MST.

With this in mind, we locally estimate all the latent sub-trees, denoted as N i, by applying Recursive Group-ing ((Choi et al., 2011)) in a parallel manner on nbdsub(vi;MST), ∀vi ∈ Xint. Note that the latent nodes automaticallyintroduced by RG(vi) have vi as their surrogate. We update the tree structure by joining each level in a bottom-upmanner. The testing of the relationship among nodes ((Choi et al., 2011)) uses the additive multivariate informationdistance metric (Appendix E) Φ(vi, vj ; k) = dist(vi, vk) − dist(vi, vk) to decide whether the nodes vi and vj areparent-child or siblings. If they are siblings, they should be joined by a hidden parent. If they are parent and child, the

Page 8: Appendix:Guaranteed Scalable Learning of Latent Tree Mode lsauai.org/uai2019/proceedings/supplements/319_supplement.pdf · interpreting smooth score background clues architectures

child node is placed as a lower level node and we add the other node as the single parent node, which is then joined inthe next level.

Finally, for each internal edge of MST connecting two internal nodes vi and vj , we consider merging the latent sub-trees. In the example of two local estimated latent sub-trees in Figure 1, we illustrate the complete local mergingalgorithm that we propose.

G PROOF SKETCH FOR THEOREM 7.1

We argue for the correctness of the method under exact moments. The sample complexity follows from the previousworks. In order to clarify the proof ideas, we define the notion of surrogate node ((Choi et al., 2011)) as follows.

Definition G.1. Surrogate node for hidden node hi on the latent tree T = (V , E) is defined as Sg(hi; T ) :=arg min

vj∈Xdist(vi, vj).

In other words, the surrogate for a hidden node is an observable node which has the minimum multivariate informationdistance from the hidden node. See Figure 1(a), the surrogate node of h1, Sg(h1; T ), is v3, Sg(h2; T ) = Sg(h3; T ) =v5. Note that the notion of the surrogate node is only required for analysis, and our algorithm does not need to knowthis information.

The notion of surrogacy allows us to relate the constructed MST (over observed nodes) with the underlying latent tree.It can be easily shown that contracting the hidden nodes to their surrogates on latent tree leads to MST. Local recursivegrouping procedure can be viewed as reversing these contractions, and hence, we obtain consistent local sub-trees.

We now argue the correctness of the structure union procedure, which merges the local sub-trees. In each reconstructedsub-tree N i, where vi is the group leader, the discovered hidden nodes {hi} form a surrogate relationship with vi, i.e.Sg(hi; T ) = vi. Our merging approach maintains these surrogate relationships. For example in Figure 1(d1,d2), wehave the path v3 − h1 − v5 in N 3 and path v3 − h3 − h2 − v5 in N 5. The resulting path is v3 − h1 − h3 − h2 − v5,as seen in Figure 1(e). We now argue why this is correct. As discussed before, Sg(h1; T ) = v3 and Sg(h2; T ) =Sg(h3; T ) = v5. When we merge the two subtrees, we want to preserve the paths from the group leaders to theadded hidden nodes, and this ensures that the surrogate relationships are preserved in the resulting merged tree. Thus,we obtain a global consistent tree structure by merging the local structures. The correctness of parameter learningcomes from the consistency of the tensor decomposition techniques and careful alignments of the hidden labels acrossdifferent decompositions. Refer to Appendix H, K for proof details and the sample complexity.

H PROOF OF CORRECTNESS FOR LRG

Definition H.1. A latent tree T≥3 is defined to be a minimal (or identifiable) latent tree if it satisfies that each latent

variable has at least 3 neighbors.

Definition H.2. Surrogate node for hidden node hi in latent tree T = (V , E) is defined as

Sg(hi; T ) := arg minvj∈X

dist(vi, vj).

There are some useful observations about the MST in Choi et al. ((2011)) which we recall here.

Property H.3 (MST − surrogate neighborhood preservation). The surrogate nodes of any two neighboring nodes in

E are also neighbors in the MST. I.e.,

(hi, hj) ∈ E ⇒ (Sg(hi), Sg(hj)) ∈ MST.

Property H.4 (MST − surrogate consistency along path). If vj ∈ X and vh ∈ Sg−1(vj), then every node along the

path connecting vj and vh belongs to the inverse surrogate set Sg−1(vj), i.e.,

vi ∈ Sg−1(vj), ∀vi ∈ Path(vj , vh)

if

vh ∈ Sg−1(vj).

Page 9: Appendix:Guaranteed Scalable Learning of Latent Tree Mode lsauai.org/uai2019/proceedings/supplements/319_supplement.pdf · interpreting smooth score background clues architectures

The MST properties observed connect the MST over observable nodes with the original latent tree T . We obtain MSTby contracting all the latent nodes to its surrogate node.

Given that the correctness of CLRG algorithm is proved in Choi et al. ((2011)), we prove the equivalence between theCLRG and PLRG.

Lemma H.5. For any sub-tree pairs nbd[vi;MST] and nbd[vi;MST], there is at most one overlapping edge. Theoverlapping edge exists if and only if vi ∈ nbd(vj ;MST).

This is easy to see.

Lemma H.6. Denote the latent tree recovered from nbd[vi;MST] as N i and similarly for nbd[vj ;MST]. The incon-

sistency, if any, between N i and N j occurs in the overlapping path(vi, vj ;N i) in and path(vi, vj ;N j) after LRG

implementation on each subtrees.

We now prove the correctness of LRG. Let us denote the latent tree resulting from merging a subset of small latenttrees as TLRG(S), where S is the set of center of subtrees that are merged pair-wisely. CLRG algorithm in Choi et al.((2011)) implements the RG in a serial manner. Let us denote the latent tree learned at iteration i from CLRG isTCLRG(S), where S is the set of internal nodes visited by CLRG at current iteration . We prove the correctness of LRGby induction on the iterations.

At the initial step S = ∅: TCLRG = MST and TLRG = MST , thus TCLRG = TLRG.

Now we assume that for the same set Si−1, TCLRG = TLRG is true for r = 1, . . . , i−1. At iteration r = i where CLRGemploys RG on the immediate neighborhood of node vi on TCLRG(Si−1), let us assume that Hi is the set of hiddennodes who are immediate neighbors of i−1. The CLRG algorithm thus considers all the neighbors and implements theRG. We know that the surrogate nodes of every latent node in Hi belong to previously visited nodes Si−1. Accordingto Property H.3 and H.4, if we contract all the hidden node neighbors to their surrogate nodes, CLRG thus is a RG onneighborhood of i on MST.

As for our LRG algorithm at this step, TLRG(Si) is the merging between TLRG(Si−1)and N i. The latent nodes whosesurrogate node is j are introduced between the edge (i−1, i). Now that we know N i is the RG output from immediateneighborhood of i on MST. Therefore, we proved that TCLRG(Si) = TLRG(Si).

I CROSS GROUP ALIGNMENT CORRECTION

In order to achieve cross group alignments, tensor decompositions on two cross group triplets have to be computed.The first triplet is formed by three nodes: reference node in group 1, x1, non-reference node in group 1, x2, andreference node in group 2, x3. The second triplet is formed by three nodes as well: reference node in group 2, x3,non-reference node in group 2, x4 and reference node in group 1, x1. Let us use h1 to denote the parent node in group1, and h2 the parent node in group 2.

From Trip(x1, x2, x3), we obtain P (h1|x1) = A, P (x2|h1) = B and P (x3|h1) = P (x3|h2)P (h2|h1) = DE. FromTrip(x3, x4, x1), we know P (x3|h2) = DΠ, P (x4|h2) = CΠ and P (h2|x1) = P (h2|h1)P (h1|x1) = ΠEA, where

Π is a permutation matrix. We compute Π as Π =.

(ΠEA)(A)†(DE)†(DΠ) so that D = (DΠ)Π† is aligned with

group 1. Thus, when all the parameters in the two groups are aligned by permute group 2 parameters using Π, thus thealignment is completed.

Similarly, the alignment correction can be done by calculating the permutation matrices while merging differentthreads.

Overall, we merge the local structures and align the parameters from LRG locla sub-trees using Procedure 2 and 3.

J COMPUTATIONAL COMPLEXITY

We recall some notations here: d is the observable node dimension, k is the hidden node dimension (k ≪ d), N is thenumber of samples, p is the number of observable nodes, and z is the number of non-zero elements in each sample.

Multivariate information distance estimation involves sparse matrix multiplications to compute the pairwise second

Page 10: Appendix:Guaranteed Scalable Learning of Latent Tree Mode lsauai.org/uai2019/proceedings/supplements/319_supplement.pdf · interpreting smooth score background clues architectures

moments. Each observable node has a d × N sample matrix with z non-zeros per column. Computing the productx1xT

2 from a single sample for nodes 1 and 2 requires O(z) time and there are N such sample pair products leadingto O(Nz) time. There are O(p2) node pairs and hence the degree of parallelism is O(p2). Next, we perform thek-rank SVD of each of these matrices. Each SVD takes O(d2k) time using classical methods. Using randomizedmethods ((Gittens and Mahoney, 2013b)), this can be improved to O(d+ k3).

Next on, we construct the MST in O(log p) time per worker with p2 workers. The structure learning can be done inO(Γ3) per sub-tree and the local neighborhood of each node can be processed completely in parallel. We assume thatthe group sizes Γ are constant (the sizes are determined by the degree of nodes in the latent tree and homogeneity ofparameters across different edges of the tree. The parameter estimation of each triplet of nodes consists of implicitstochastic updates involving products of k × k and d × k matrices. Note that we do not need to consider all possibletriplets in groups but each node must be take care by a triplet and hence there are O(p) triplets. This leads to a factorof O(Γk3 + Γdk2) time per worker with p/Γ degree of parallelism.

At last, the merging step consists of products of k × k and d × k matrices for each edge in the latent tree leading toO(dk2) time per worker with p/Γ degree of parallelism.

K SAMPLE COMPLEXITY

From Theorem 11 of Choi et al. ((2011)), we recall the number of samples required for the recovery of the tree structurethat is consistent with the ground truth (for a precise definition of consistency, refer to Definition 2 of Choi et al.((2011))).

From Anandkumar et al. ((2012a)), we recall the sample complexity for the faithful recovery of parameters via tensordecomposition methods.

We define ϵP to be the noise raised between empirical estimation of the second order moments and exact second ordermoments, and ϵT to be the noise raised between empirical estimation of the third order moments and the exact thirdorder moments.

Lemma K.1. Consider positive constants C, C′, c and c′, the following holds. If

ϵP ≤ cλk

λ1

k, ϵT ≤ c′

λkσ3/2k

k

N ≥ C

/log(k) + log

/log

/λ1σ

3/2k

ϵT+

1

ϵP

000

L ≥ poly(k) log(1/δ),

then with probability at least 1 − δ, tensor decomposition returns (#vi,λi) : i ∈ [k] satisfying, after appropriate

reordering,

∥#vi − vi∥2 ≤ C′

11

λi

1

σ2k

ϵT +

1λ1

λi

1√σk

+ 1

2ϵP

2

| #λi − λi| ≤ C′

/1

σ3/2k

ϵT + λ1ϵP

0

for all i ∈ [k].

We note that σ1 ≥ σ2 ≥ . . .σk > 0 are the non-zero singular values of the second order moments, λ1 ≥ λ2 ≥ . . . ≥λk > 0 are the ground-truth eigenvalues of the third order moments, and vi are the corresponding eigenvectors for alli ∈ [k].

Page 11: Appendix:Guaranteed Scalable Learning of Latent Tree Mode lsauai.org/uai2019/proceedings/supplements/319_supplement.pdf · interpreting smooth score background clues architectures

L EFFICIENT SVD USING SPARSITY AND DIMENSIONALITY REDUCTION

Without loss of generality, we assume that a matrix whose SVD we aim to compute has no row or column which isfully zeros, since, if it does have zero entries, such row and columns can be dropped.

Let A ∈ Rn×n be the matrix to do SVD. Let Φ ∈ Rd×k, where k = αk with α is a scalar, usually, in the range[2, 3]. For the ith row of Φ, if

"i |Φ|(i, :) = 0 and

"i |Φ|(:, i) = 0, then there is only one non-zero entry and that

entry is uniformly chosen from [k]. If either"

i |Φ|(i, :) = 0 or"

i |Φ|(:, i) = 0, we leave that row blank. LetD ∈ Rd×d be a diagonal matrix with iid Rademacher entries, i.e., each non-zero entry is 1 or −1 with probability12 . Now, our embedding matrix ((Clarkson and Woodruff, 2013)) is S = DΦ, i.e., we find AS and then proceed withthe Nystrom ((Huang et al., 2013)) method. Unlike the usual Nystrom method ((Gittens and Mahoney, 2013a)) whichuses a random matrix for computing the embedding, we improve upon this by using a sparse matrix for the embeddingsince the sparsity improves the running time and the memory requirements of the algorithm.