multifaceted algorithm design richard peng m.i.t

Multifaceted Algorithm Design

Richard PengM.I.T.

LARGE SCALE PROBLEMS

Emphasis on efficient algorithms in:• Scientific computing• Graph theory• (randomized) numerical

routines

Network Analysis

Physical Simulation

Optimization

WELL STUDIED QUESTIONS

Scientific computing: fast solvers for structured linear systems

Graphs / combinatorics: network flow problems

Randomized algorithms: subsampling matrices and optimization formulations

B B’

MY REPRESENTATIVE RESULTS

Lx=b

B B’

Current fastest sequential and parallel solvers for linear systems in graph Laplacians matrices

First nearly-linear time algorithm for approximate undirected maxflow

First near-optimal routine for row sampling matrices in a 1-norm preserving manner

RECURRING IDEAS

Can solve a problem by iteratively solving several similar instances

Approximations lead to better approximations

Larger problems can be approximated by smaller ones

Approximator

Data

MY APPROACH TO ALGORITHM DESIGN

Numerical analysis /Optimization

Statistics /Randomized algorithms

Problems at their intersection

Identify problems that arise at the intersection of multiple areas and study them from multiple angles

Combinatorics / Discrete algorithms

This talk: structure-preserving sampling

SAMPLING

Classical use in statistics:• Extract info from a large data

set• Directly output result

(estimator)

Sampling from matrices, networks, and optimization problems:• Often compute on the sample• Need to preserve more structure

B B’

PRESERVING GRAPH STRUCTURESUndirected graph, n vertices, m < n2 edges

Is n2 edges (dense) sometimes necessary?

For some information, e.g. connectivity:encoded by spanning forest, < n edges

Deterministic, O(m) time algorithm

: questions

MORE INTRICATE STRUCTURES

k-connectivity: # of disjoint paths between s-t

[Benczur-Karger `96]: for ANY G, can sample to get H with O(nlogn) edges s.t. G ≈ H on all cuts

Stronger: weights of all 2n cuts in graphs

Cut: # of edges leaving a subset of vertices

s

t

Menger’s theorem / maxflow-mincut

: previous works

≈: multiplicative approximation

HOW TO SAMPLE?Widely used: uniform sampling Works well when data is

uniform e.g. complete graph

Problem: long path, removing any edge changes connectivity

(can also have both in one graph)

More systematic view of sampling?

ALGEBRAIC REPRESENTATION OF GRAPHS

n rows / columnsO(m) non-zeros

1

1

n verticesm edges

graph Laplacian Matrix L• Diagonal: degree• Off-diagonal:

-edge weights

Edge-vertex incidence matrix:Beu = -1/1 if u is

endpoint of e

0 otherwise

m rowsn columns

L is the Gram matrix of B, L = BTB

2 -1 -1 -1 1 0 -1 0 1

1 -1 0 -1 0 1

xv=0

SPECTRAL SIMILARITY

Numerical analysis:LG ≈ LH if xTLGx ≈ xTLHx for all vectors x

x = {0, 1}V:

G ≈ H on all cuts

xu=1 xz=1

(1-0)2=1

(1-1)2=0

Gram matrix: LG = BGTBG xTLGx

=║BGx║22

Beu = -1/1 if u is endpoint of e

0 otherwise

║BGx║2 ≈║BHx║2 ∀ x

║yi║22

=Σi yi2

For edge e = uv, (Be:x) 2

= (xu – xv)2

║BGx║22 = size of cut given by

x

n

n

ALGEBRAIC VIEW OF SAMPLING EDGES

B’

B

L2 Row sampling:

Given B with m>>n, sample a few rows to form B’ s.t.║Bx║2 ≈║B’x║2 ∀ x

Note: normally use A instead of B, n and d instead of m and n

m

0 -1 0 0 0 1 0 0 -5 0 0 0 5

0≈n

IMPORTANCE SAMPLING

Issue: only one non-zero row

Keep a row, bi, with probability pi, rescale if kept to maintain expectation

Uniform sampling: pi = 1/k for a factor k size reduction

norm sampling:pi =m/k║bi║2

2 / ║B║F2

Issue: column with one entry

THE `RIGHT’ PROBABILITIES

Only one non-zero row Column with one entry

00100

n/mn/mn/mn/m1

Path + clique:

1

1/n

bi: row i of B,L = BTB

τ: L2 statistical leverage scores

τi = biT(BTB)-1bi = ║bi║2

L-

1

L2 MATRIX-CHERNOFF BOUNDS

[Foster `49] Σi τi = rank ≤ n O(nlogn) rows

[Rudelson, Vershynin `07], [Tropp `12]: sampling with pi ≥ τiO( logn) gives B’ s.t. ║Bx║2 ≈║B’x║2 ∀x w.h.p.

τ: L2 statistical leverage scores

τi = biT(BTB)-1bi = ║bi║2

L-

1

Near optimal:• L2-row samples of

B• Graph sparsifiers

• In practice O(logn) 5 usually suffices

• can also improve via derandomization

MY APPROACH TO ALGORITHM DESIGN

Extend insights gained from studying problems at the intersection of multiple areas back to these areas





Algorithmic extensions of structure-preserving sampling

Maximum flow

Solving linear systems

Preserving L1-structures

SUMMARY

• Algorithm design approach: study problems at the intersection of areas, and extend insights back.• Can sparsify objects via importance

sampling.

Graph Laplacian• Diagonal: degree• Off-diagonal: -

weightCombinatorics / Discrete algorithms


Solvers for linear systems involving graph Laplacians

Lx = b

Current fastest sequential and parallel solvers for linear systems in graph Laplacians

Lx=b

Application: estimate all τi =║bi║2

L-1 by solving O(logn) linear systems

Directly related to:• Elliptic problems• SDD, M, and H-

matrices


ALGORITHMS FOR Lx = b

Given any graph Laplacian L with n vertices and m edges, any vector b, find vector x s.t. Lx = b

[Vaidya `89]: use graph theory!

2014: 1/2

loglog plot of c:

2011: 1

2010: 2

[Spielman-Teng `04]: O(mlogcn)

[P-Spielman `14]: alternate, fully parallelizable approach: my

results

2006: 32

2004: 70

2009: 15

2010: 6

: previous works

: questions

ITERATIVE METHODS

Division using multiplicationI + A + A2 + A3 + …. = (I – A)-1

= L-1

Spectral theorem: can view as scalars

Simplification: assume L = I – A,A: transition matrix of random walk

Richardson iteration: truncate to i terms,Approximate x = (I – A)-1b with x(i) = (I + A + … Ai)b

RICHARDSON ITERATION

#terms needed lower bounded by information propagation Adiameterb

Highly connected graphs: few terms ok

b Ab A2b

Need n matrix operations?

Evaluation (Horner’s rule):• (I + A + A2)b = A(Ab + b) +

b• i terms: x(0) = b, x(i + 1) = Ax(i)

+ b

i matrix-vector multiplications

Can interpret as gradient descent

(I – A)-1 = I + A + A2 + A3 + …. = (I + A) (I + A2) (I +

A4)…

DEGREE N N OPERATIONS?

Combinatorial view:• A: step of random walk• I – A2: Laplacian of the 2 step random walk

Dense matrix!

Repeated squaring: A16 = ((((A2)2)2)2, 4 operations

• O(logn) terms ok• Similar to multi-level

methods

Still a graph Laplacian!

Can sparsify!

REPEATED SPARSE SQUARING

Combining known tools: efficiently sparsify I – A2 without computing A2

(I – A)-1 = (I + A) (I + A2) (I + A4)…

[P-Spielman `14] approximate L-1 with O(logn) sparse matrices

key ideas: modify factorization to allow gradual introduction and control of error

SUMMARY

• Algorithm design approach: study problems at the intersection of areas, and extend insights back.• Can sparsify objects via importance sampling.• Solve Lx=b efficiently via sparsified

squaring.

FEW ITERATIONS OF Lx = b• [Tutte `61]: graph drawing, embeddings• [ZGL `03], [ZHS `05]: inference on graphical

models

Inverse powering: eigenvectors / heat kernel:• [AM `85] spectral clustering• [OSV `12]: balanced cuts• [SM `01][KMST `09]: image segmentation

[CFMNPW`14]: Helmholtz decomp. on 3D mesh

MANY ITERATIONS OF Lx = b[Karmarkar, Ye, Renegar, Nesterov, Nemirovski …]: convex optimization via. solving O(m1/2) linear systems

[DS `08]: optimization on graphs Laplacian systems

[KM `09][MST`14]: random spanning trees

[CKMST `11]: faster approx maximum flow

[KMP `12]: multicommodity flow

MAXFLOW




Maximum flow

First O(mpolylog(n)) time algorithm for approximate undirected maxflow

(for unweighted, undirected graphs)

MAXIMUM FLOW PROBLEM

s

t

s

t

Given s, t, find the maximum number of disjoint s-t paths

Dual: separate s and t by removing fewest edges

Applications:• Clustering• Image processing• Scheduling

WHAT MAKES MAXFLOW HARD

Highly connected: route up to n paths

Long paths: a step may involve n vertices

Goal: handle both and do better than many steps × long paths = n2

Each ‘easy’ on their own

ALGORITHMS FOR FLOWS

Current fastest maxflow algorithms:• Exact (weakly-polytime): invoke Lx=b• Approximate: modify algorithms for

Lx=b[P`14]: (1 – ε)-approx maxflow in O(mlogcnε-2) time

Ideas introduced:

1980: dynamic trees

1970s: Blocking flows

1986: dual algorithms

1989: connections to Lx = b

2013: modify Lx = b

2010: few calls to Lx = b

Algebraic formulation of min s-t cut:Minimize ║Bx║2 subject to xs = 0, xt = 1 and x integral

MAXIMUM FLOW IN ALMOST LINEAR TIME

[Madry `10]: finding O(m1+θ) sized approximator that require O(mθ) calls in O(m1+θ) time (for any θ > 0)Approximator

Maxflow [Racke-Shah-Taubig `14] O(n) sized approximator that require O(logcn) iterations via solving maxflows on graphs of total size O(mlogcn)

Maxflow Maxflow

Approximator Approximator

Chicken and egg problem

O(m1+2θε-2) timeO(mlogcnε-2) time?

Algebraic formulation of min s-t cut:Minimize ║Bx║1 subject to xs = 0, xt = 1 ║*║1 : 1-norm, sum of absolute

values

[Sherman `13] [Kelner-Lee-Orecchia-Sidford `13]:can find approximate maxflow iteratively via several calls to a structure approximator

ALGORITHMIC SOLUTION

Ultra-sparsifier (e.g. [Koutis-Miller-P `10]): for any k, can find H close to G, but equivalent to graph of size O(m/k)

` `

Maxflow

Absorb additional (small) error via more calls to approximatorRecurse on instances with smaller total size, total cost: O(mlogcn)

Key step: vertex reductions via edge reductions[P`14]: build approximator on the smaller graph

[CLMPPS`15]: extends to numerical data, has close connections to variants of Nystrom’s method

SUMMARY

• Algorithm design approach: study problems at the intersection of areas, and extend insights back.• Can sparsify objects via importance sampling.• Solve Lx=b efficiently via sparsified squaring.• Approximate maximum flow routines and

structure approximators can be constructed recursively from each other via graph sparsifiers.

RANDOMIZED NUMERICALLINEAR ALGEBRA




L1-preserving row sampling

B B’

First near-optimal routine for row sampling matrices in a 1-norm preserving manner

║y║1║y║2

GENERALIZATIONGeneralization of row sampling:given A, q, find A’ s.t.║Ax║q ≈║A’x║q ∀ x

1-norm: standard for representing cuts, used in sparse recovery / robust regression

Applications (for general A):• Feature selection• Low rank approximation / PCA

q-norm: ║y║q = (Σ|yi|q)1/q

Omitting corresponding empirical studies

ROW SAMPLING ROUTINES

#rows for q=2

#rows for q=1

Runtime

Dasgupta et al. `09 n2.5 mn5

Magdon-Ismail `10 nlog2n mn2

Sohler-Woodruff `11 n3.5 mnω-1+θ

Drineas et al. `12 nlogn mnlogn

Clarkson et al. `12 n4.5log1.5n mnlogn

Clarkson-Woodruff `12 n2logn n8 nnz

Mahoney-Meng `12 n2 n3.5 nnz+n6

Nelson-Nguyen `12 n1+θ nnz

Li et.`13, Cohen et al. 14

nlogn n3.66 nnz+nω+θ

[Naor `11][Matousek `97]: on graphs, L2 approx Lq approx ∀ 1 ≤ q ≤ 2

How special are graphs?

A’ s.t.║Ax║q ≈║A’x║q ∀ x nnz: # of non-zeros in A

How special is L2?

L1 ROW SAMPLING

L1 Lewis weights ([Lewis `78]):

w s.t. wi2 = ai

T(ATW-

1A)-1ai

Recursive definition!

[Sampling with pi ≥ wiO( logn) gives ║Ax║1 ≈ ║A’x║1

∀x

Can check: Σi wi ≤ n O(nlogn) rows

[Talagrand `90, “Embedding subspaces of L1 into LN

1”] can be analyzed as row-sampling /

sparsification

[COHEN-P `14]

Update w on LHS with w on RHS

w’i (ai

T(ATW-1A)-1ai)1/2

q Previous # of rows New # Rows Runtime

1 n2.5 nlogn nnz+nω+θ

1 < q < 2 nq/2+2 nlogn(loglogn)2 nnz+nω+θ

2 < q nq+1 np/2logn nnz+nq/2+O(1)

Converges in loglogn steps: analyze ATW-1A spectrally

Aside: similar to iterative reweighted least squares

Elementary, optimization motivated proof of w.h.p. concentration for L1

SUMMARY

• Algorithm design approach: study problems at the intersection of areas, and extend insights back.• Can sparsify objects via importance sampling.• Solve Lx=b efficiently via sparsified squaring.• Approximate maximum flow routines and cut-

approximators can be constructed recursively from each other via graph sparsifiers.• Wider ranges of structures can be

sparsified, key statistical quantities can be computed iteratively.

I’VE ALSO WORKED ON

• Dynamic graph data structures• Graph partitioning• Parallel algorithms• Image processing• Anomaly / sybil

detection in graphs

FUTURE WORK:LINEAR SYSTEM SOLVERS

• Wider classes of linear systems• Relation to optimization /

learning




Mx=bSolvers for linear systems involving graph Laplacians

FUTURE WORK:COMBINATORIAL OPTIMIZATION

Faster algorithms for more classical algorithmic graph theory problems?




Maximum flow

FUTURE WORK: RANDOMIZED NUMERICAL LINEAR ALGEBRA

• Other algorithmic applications of Lewis weights?• Low-rank approximation in L1?

• O(n)-sized L1-preserving row samples?(these exist for L2)




L1-preserving row sampling

B B’

SUMMARY


Numerical analysis / Optimization



B B’

Links to arXiv manuscripts and videos of more detailed talks are at:

math.mit.edu/~rpeng/

Mx=b

multifaceted algorithm design richard peng m.i.t

Documents

row sampling matrices

gram matrix of b

linear time algorithm

vectors x x

optimization problems

n cuts

onlogn edges

intersectionidentify