optimal transport - pkubicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 applications: comparing...
TRANSCRIPT
![Page 1: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2](https://reader033.vdocuments.mx/reader033/viewer/2022042208/5eab9e5946801b0b723efeba/html5/thumbnails/1.jpg)
Optimal Transport
http://bicmr.pku.edu.cn/~wenzw/bigdata2019.html
Acknowledgement: this slides is based on Prof. Gabriel Peyré’s lecture notes
1/59
![Page 2: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2](https://reader033.vdocuments.mx/reader033/viewer/2022042208/5eab9e5946801b0b723efeba/html5/thumbnails/2.jpg)
2/59
Applications: comparing measures
������������������
! images, vision, graphics and machine learning, . . .
• Optimal transport
Optimal transport meanL2mean
![Page 3: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2](https://reader033.vdocuments.mx/reader033/viewer/2022042208/5eab9e5946801b0b723efeba/html5/thumbnails/3.jpg)
3/59
Applications: toward high-dimensional OT
��������������������������
����� ����������� ������� ������� ���� ������ �������
![Page 4: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2](https://reader033.vdocuments.mx/reader033/viewer/2022042208/5eab9e5946801b0b723efeba/html5/thumbnails/4.jpg)
4/59
Outline
1 Problem Formulation
2 Applications
3 Entropic Regularization
4 Sinkhorn’s Algorithm
5 Sinkhorn-Newton method
6 Shielding Neighborhood Method
![Page 5: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2](https://reader033.vdocuments.mx/reader033/viewer/2022042208/5eab9e5946801b0b723efeba/html5/thumbnails/5.jpg)
5/59
Kantorovitch’s Formulation
Discrete Optimal TransportInput two discrete probability measures
α =
m∑i=1
aiδxi , β =
n∑j=1
bjδyj . (1)
X = {xi}i, Y = {xj}j: are given points clouds, xi, yi are vectors.ai, bj : positive weights,
∑mi=1 ai =
∑nj=1 bj = 1.
Cij: costs, Cij = c(xi, yj) ≥ 0.
Couplings
U(a, b)def= {Π ∈ Rm×n
+ ; Π1n = a,Π>1m = b} (2)
is called the set of couplings with respect to α and β.
![Page 6: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2](https://reader033.vdocuments.mx/reader033/viewer/2022042208/5eab9e5946801b0b723efeba/html5/thumbnails/6.jpg)
6/59
Kantorovitch’s Formulation
Discrete Optimal TransportIn the Optimal transport, we want to compute the following quantity[Kantorovich 1942]
Optimal transport distance
LC(a, b)def= min
∑i,j
Ci,jΠi,j; Π ∈ U(a, b)
. (3)
![Page 7: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2](https://reader033.vdocuments.mx/reader033/viewer/2022042208/5eab9e5946801b0b723efeba/html5/thumbnails/7.jpg)
7/59
Push Forward
Radon measures (α, β) on (X ,Y).Transfer of measure by T : X → Y: push forward.The measure T#α on Y is defined by
T#α(Y) = α(T−1(Y)), for all measurable Y ∈ Y. (4)
Equivalently, ∫Y
g(y)dT#α(y)def=
∫X
g(T(x))dα(x). (5)
Discrete measures: T#α =∑
i αiδT(xi)
Smooth densities: dα = ρ(x)dx, dβ = ξ(x)dx.
T#α = β ⇐⇒ ρ(T(x))|det(∂T(x))| = ξ(x). (6)
![Page 8: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2](https://reader033.vdocuments.mx/reader033/viewer/2022042208/5eab9e5946801b0b723efeba/html5/thumbnails/8.jpg)
8/59
Monge problem
Monge problem seeks for a map that associates to each point xi
a single point yj, and which must push the mass of α toward themass of β, namely:
∀j, bj =∑
i:T(xi)=yj
ai
Discrete case:
minT
∑i
c(xi,T(xi)), s.t. T#α = β
Arbitrary measures:
minT
∫X
c(x,T(x))dα(x), s.t. T#α = β
![Page 9: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2](https://reader033.vdocuments.mx/reader033/viewer/2022042208/5eab9e5946801b0b723efeba/html5/thumbnails/9.jpg)
9/59
Couplings between General Measures
Projectors:PX : (x, y) ∈ X × Y → x ∈ X ,PY : (x, y) ∈ X × Y → y ∈ Y.
(7)
Couplings between General Measures
U(α, β)def= {π ∈M+(X × Y); PX#π = α,PY#π = β}. (8)
is called the set of couplings with respect to α and β.
![Page 10: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2](https://reader033.vdocuments.mx/reader033/viewer/2022042208/5eab9e5946801b0b723efeba/html5/thumbnails/10.jpg)
10/59
Cases of Couplings
�������������������������
⇡
Discrete
⇡
Continuous
⇡
Semi-discrete
βββ
↵↵ ↵
βββ↵ ↵ ↵
![Page 11: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2](https://reader033.vdocuments.mx/reader033/viewer/2022042208/5eab9e5946801b0b723efeba/html5/thumbnails/11.jpg)
11/59
More Examples
���������������������
β
↵
β
↵
⇡β
↵
β
↵
⇡
β
↵
β
↵
⇡
↵
β
↵
⇡β
![Page 12: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2](https://reader033.vdocuments.mx/reader033/viewer/2022042208/5eab9e5946801b0b723efeba/html5/thumbnails/12.jpg)
12/59
Kantorovitch Problem for General Measures
Optimal transport distance between General Measures
L(α, β, c)def= min
π∈U(α,β)
∫X×Y
c(x, y)dπ(x, y). (9)
Probability interpretation:
min(X,Y){E(X,Y)(c(X,Y)),X ∼ α,Y ∼ β}. (10)
![Page 13: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2](https://reader033.vdocuments.mx/reader033/viewer/2022042208/5eab9e5946801b0b723efeba/html5/thumbnails/13.jpg)
13/59
Wasserstein Distance
Metric Space X = Y.Distance d(x, y) (nonegative, symmetric, identity, triangle inequality).Cost c(x, y) = d(x, y)p, p ≥ 1.
Wasserstein Distance
Wp(α, β)def= L(α, β, dp)1/p. (11)
TheoremWp is a distance, and
Wp(αn, α)→ 0 ⇐⇒ αnweak→ α. (12)
Example
Wp(δx, δy) = d(x, y). (13)
![Page 14: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2](https://reader033.vdocuments.mx/reader033/viewer/2022042208/5eab9e5946801b0b723efeba/html5/thumbnails/14.jpg)
14/59
Dual form
Dual problem (discrete case)
maxf∈Rm,g∈Rn
f>a + g>b,
subject to fi + gj ≤ Cij, ∀(i, j)(14)
Relation between any primal and dual solutions:
Pij > 0⇒ fi + gj = Cij.
![Page 15: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2](https://reader033.vdocuments.mx/reader033/viewer/2022042208/5eab9e5946801b0b723efeba/html5/thumbnails/15.jpg)
15/59
Wasserstein barycenter
Define C def= MXY , where (MXY)ij = d(xi, yi)
p. The Wassersteindistance as
L(a, b,C)def= min
∑i,j
Ci,jΠi,j; Π ∈ U(a, b)
. (15)
Given a set of point clouds and their corresponding probabilityvector {(Y(i), b(i))}, i = 1, . . . ,N.Find a support X = {xi} with a probability vector a such that(X, a) is the optimal solution of the following problem
minX,a
1N
N∑k=1
L(a, b(k),MXYk)
![Page 16: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2](https://reader033.vdocuments.mx/reader033/viewer/2022042208/5eab9e5946801b0b723efeba/html5/thumbnails/16.jpg)
16/59
Outline
1 Problem Formulation
2 Applications
3 Entropic Regularization
4 Sinkhorn’s Algorithm
5 Sinkhorn-Newton method
6 Shielding Neighborhood Method
![Page 17: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2](https://reader033.vdocuments.mx/reader033/viewer/2022042208/5eab9e5946801b0b723efeba/html5/thumbnails/17.jpg)
17/59
Applications: Grayscale image equalization
����������������������������
t = 0 t = 0.25 t = 0.5 t = .75 t = 1
![Page 18: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2](https://reader033.vdocuments.mx/reader033/viewer/2022042208/5eab9e5946801b0b723efeba/html5/thumbnails/18.jpg)
18/59
Applications: image color adaptation
Example: https://github.com/rflamary/POT/blob/master/notebooks/plot_otda_color_images.ipynb
Given color image stored in the RGB format: I1, I2# Converts an image to matrix (one pixel per line)X1 = im2mat(I1), X2 = im2mat(I2)# Take samplesXs = X1[idx1, :], Xt = X2[idx2, :]# Scatter plot of colorspl.scatter(Xs[:, 0], Xs[:, 2], c=Xs)# Sinkhorn Transportot_sinkhorn = ot.da.SinkhornTransport(reg_e=1e-1)ot_sinkhorn.fit(Xs=Xs, Xt=Xt)# prediction between imagestransp_Xs_sinkhorn = ot_sinkhorn.transform(Xs=X1)transp_Xt_sinkhorn = ot_sinkhorn.inverse_transform(Xt=X2)
![Page 19: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2](https://reader033.vdocuments.mx/reader033/viewer/2022042208/5eab9e5946801b0b723efeba/html5/thumbnails/19.jpg)
19/59
Applications: image color adaptation
![Page 20: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2](https://reader033.vdocuments.mx/reader033/viewer/2022042208/5eab9e5946801b0b723efeba/html5/thumbnails/20.jpg)
20/59
Applications: image color palette equalization
Optimal
transport
��������������������������������
![Page 21: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2](https://reader033.vdocuments.mx/reader033/viewer/2022042208/5eab9e5946801b0b723efeba/html5/thumbnails/21.jpg)
21/59
Applications: shape interpolation
�������������������
![Page 22: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2](https://reader033.vdocuments.mx/reader033/viewer/2022042208/5eab9e5946801b0b723efeba/html5/thumbnails/22.jpg)
22/59
Applications: MRI Data Processing
�������������������������������������
L2 barycenter
W2
2barycenter
Ground cost c = dM : geodesic on cortical surface M .
![Page 23: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2](https://reader033.vdocuments.mx/reader033/viewer/2022042208/5eab9e5946801b0b723efeba/html5/thumbnails/23.jpg)
23/59
Applications
�������������������������
�������������������������������������������
���������������������������
![Page 24: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2](https://reader033.vdocuments.mx/reader033/viewer/2022042208/5eab9e5946801b0b723efeba/html5/thumbnails/24.jpg)
24/59
Applications
��������������������������������������
input
output
input
output
µ
⌫
Shape registration:loss regularity
min' di↵eo
D('(µ), ⌫) +R(')
σ
(µ− ⌫) ? kσ
Hilbertian loss (MMD/RKHS):
D(µ, ⌫) = ||kσ ? (µ− ⌫)||2L2
it 0
it 250
! Sinkhorn’s iterates “propagate” a small bandwidth kernel.
! Automatic di↵erentation: game changer for advanced loss and models.
! Do not use OT for registration . . . but as a loss.
Joint work with J. Feydy, B. Charier, F-X. Vialard.
Sinkhorn divergence:
D(µ, ⌫) = W"(µ, ⌫)
![Page 25: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2](https://reader033.vdocuments.mx/reader033/viewer/2022042208/5eab9e5946801b0b723efeba/html5/thumbnails/25.jpg)
25/59
Applications: word mover’s distance
normalized bag-of-words (nBOW), word travel cost (word2vecdistance), document distance Tijc(i, j), transportation problem
�������������������������
����������� dist(D1, D2) = W2(µ,⌫)
µ
⌫
������������
![Page 26: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2](https://reader033.vdocuments.mx/reader033/viewer/2022042208/5eab9e5946801b0b723efeba/html5/thumbnails/26.jpg)
26/59
Applications: word mover’s distance
minΠ≥0
∑ij
Πijcij
s.t.n∑
j=1
Πij = di
n∑i=1
Πij = d′j
xi: word2vec embeddingcij = ‖xi − xj‖2
if word i appears wi times in the document, we denote di = wi∑wj
![Page 27: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2](https://reader033.vdocuments.mx/reader033/viewer/2022042208/5eab9e5946801b0b723efeba/html5/thumbnails/27.jpg)
27/59
Applications: topic models
Top-left topic: competitions. Top-right: time. Bottom-left: socceractions. Bottom-right: drugs.
����������
������������
![Page 28: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2](https://reader033.vdocuments.mx/reader033/viewer/2022042208/5eab9e5946801b0b723efeba/html5/thumbnails/28.jpg)
28/59
Applications
���������������������������������������
!""""""""""""""#""""""""""""""$Source
!"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""#""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""$Targets
0 0
MDS in 3-D
Use T to define registration between:Colors
distributionShapeShapeShape
Geodesic distances GW distances MDS
Vizualization
Shapes(Xs)s
1
5
10
15
20
25
30
35
40
45
MDS in 2-D
![Page 29: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2](https://reader033.vdocuments.mx/reader033/viewer/2022042208/5eab9e5946801b0b723efeba/html5/thumbnails/29.jpg)
29/59
Outline
1 Problem Formulation
2 Applications
3 Entropic Regularization
4 Sinkhorn’s Algorithm
5 Sinkhorn-Newton method
6 Shielding Neighborhood Method
![Page 30: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2](https://reader033.vdocuments.mx/reader033/viewer/2022042208/5eab9e5946801b0b723efeba/html5/thumbnails/30.jpg)
30/59
Discrete OT Review
Given an integer n > 1, we write Σn for the discrete probability simplex
Σndef=
{a ∈ R+
n ;
n∑i=1
ai = 1.
}(16)
Given a ∈ Σm, b ∈ Σn, the Optimal Transport problem is to compute
L(a, b,C)def= min{
∑i,j
Ci,jPi,j; s.t. P ∈ U(a, b)}. (17)
Where U(a, b) is the set of couplings between a and b.
![Page 31: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2](https://reader033.vdocuments.mx/reader033/viewer/2022042208/5eab9e5946801b0b723efeba/html5/thumbnails/31.jpg)
31/59
Entropy
The discrete entropy of a positive matrix P (∑
ij Pij = 1) is defined as
H(P)def= −
∑i,j
Pi,j(log(Pi,j)− 1). (18)
For a positive vector u ∈ Σn, the entropy is defined analogously:
H(u)def= −
∑i
ui(log(ui)− 1). (19)
For two positive vector u, v ∈ Σn, the Kullback-Leibler divergence (or,KL divergence) is defined to be
KL(u‖v) = −n∑
i=1
ui log(vi
ui). (20)
The KL divergence is always non-negative: KL(u‖v) ≥ 0 (Jensen’sinequality: E[f (g(X))] ≥ f (E[g(X)])).
![Page 32: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2](https://reader033.vdocuments.mx/reader033/viewer/2022042208/5eab9e5946801b0b723efeba/html5/thumbnails/32.jpg)
32/59
Entropic regularization
Given a ∈ Σm, b ∈ Σn and cost matrix C ∈ Rm×n+ . The entropic
regularization of the transportation problem reads
Lε(a, b,C) = minP∈U(a,b)
〈P,C〉 − εH(P). (21)
The case ε = 0 corresponds to the classic (linear) optimaltransport problem.For ε > 0, problem (21) has an ε-strongly convex objective andtherefore admits a unique optimal solution P?ε.
This is not (necessarily) true for ε = 0. But we have the followingproposition.
![Page 33: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2](https://reader033.vdocuments.mx/reader033/viewer/2022042208/5eab9e5946801b0b723efeba/html5/thumbnails/33.jpg)
33/59
Entropic regularization
PropositionWhen ε→ 0, the unique solution Pε of (21) converges to the optimalsolution with maximal entropy within the set of all optimal solutions ofthe unregularized transportation problem, namely,
Pεε→0→ argmaxP{H(P); P ∈ U(a, b), 〈P,C〉 = L0(a, b,C)} (22)
The above proposition motivates us to solve the problems in (21)sequentially and then take ε→ 0.
![Page 34: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2](https://reader033.vdocuments.mx/reader033/viewer/2022042208/5eab9e5946801b0b723efeba/html5/thumbnails/34.jpg)
34/59
Entropic regularization
ProofWe consider a sequence (ε`)` such that ε` → 0 and ε` > 0. Wedenote P` = P?ε` . Since U(a, b) is bounded, we can extract asequence (that we do not relabel for the sake of simplicity) such thatP` → P?. Since U(a, b) is closed, P? ∈ U(a, b). We consider any Psuch that 〈C,P〉 = L0(a, b,C). By optimality of P and P` for theirrespective optimization problems (for ε = 0 and ε = ε`), one has
0 ≤ 〈C,P`〉 − 〈C,P〉 ≤ ε`(H(P`)− H(P)). (23)
Since H is continuous, taking the limit `→ +∞ in this expressionshows that 〈C,P?〉 = 〈C,P〉. Furthermore, dividing by ε` and takingthe limit shows that H(P) 6 H(P?). Now the result follows from thestrictly convexity of −H.
![Page 35: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2](https://reader033.vdocuments.mx/reader033/viewer/2022042208/5eab9e5946801b0b723efeba/html5/thumbnails/35.jpg)
35/59
Entropic regularization
By the concavity of entropy, for α > 0, we introduce the convex set
Uα(a, b)def= {P ∈ U(a, b)|KL(P‖ab>) ≤ α}= {P ∈ U(a, b)|H(P) ≥ H(a) + H(b)− 1− α}.
(24)
Definition: Sinkhorn Distance
dC,α(a, b)def= min
P∈Uα(a,b)〈C,P〉. (25)
PropositionFor α ≥ 0, dC,α(a, b) is symmetric and satisfies all triangle inequalities.Moreover, 1a6=bdC,α(a, b) satisfies all three distance axioms.
![Page 36: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2](https://reader033.vdocuments.mx/reader033/viewer/2022042208/5eab9e5946801b0b723efeba/html5/thumbnails/36.jpg)
36/59
Entropic regularization
PropositionFor α large enough, the Sinkhorn distance dC,α is the transportdistance dC.
Proof.Note that for any P ∈ U(a, b), we have
H(P) ≥ 12
(H(a) + H(b)), (26)
so for α ≥ 12(H(a) + H(b))− 1, we have
Uα(a, b) = U(a, b).
![Page 37: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2](https://reader033.vdocuments.mx/reader033/viewer/2022042208/5eab9e5946801b0b723efeba/html5/thumbnails/37.jpg)
37/59
Outline
1 Problem Formulation
2 Applications
3 Entropic Regularization
4 Sinkhorn’s Algorithm
5 Sinkhorn-Newton method
6 Shielding Neighborhood Method
![Page 38: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2](https://reader033.vdocuments.mx/reader033/viewer/2022042208/5eab9e5946801b0b723efeba/html5/thumbnails/38.jpg)
38/59
Sinkhorn’s algorithm
For solving (21), consider its Lagrangian dual function
LεC(P, α, β) = 〈C,P〉 − εH(P) + α>(P1n − a) + β>(P>1m − b). (27)
Now let ∂LεC/∂pij = 0, i.e.,
pij = e−cij+αi+βj
ε , (28)
so we can write
Pε = diag(e−αε )e−
Cε diag(e−
βε ). (29)
Note thatPε1n = a, P>ε 1m = b, (30)
we can then use Sinkhorn’s algorithm to find Pε!
![Page 39: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2](https://reader033.vdocuments.mx/reader033/viewer/2022042208/5eab9e5946801b0b723efeba/html5/thumbnails/39.jpg)
39/59
Sinkhorn’s algorithm
Let u = e−αε , v = e−
βε and K = e−C/ε. We again state the KKT system
of (21):Pε = diag(u)Kdiag(v),
a = diag(u)Kv,
b = diag(v)K>u.
(31)
Then the Sinkhorn’s algorithm amounts to alternating updates in theform of
u(k+1) = diag(Kv(k))−1a,
v(k+1) = diag(K>u(k+1))−1b.(32)
![Page 40: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2](https://reader033.vdocuments.mx/reader033/viewer/2022042208/5eab9e5946801b0b723efeba/html5/thumbnails/40.jpg)
40/59
Sinkhorn’s algorithm
Sinkhorn’s algorithm
1. Compute K = e−Cε .
2. Compute K = diag(a−1)K.3. Initial scale factor u ∈ Rm.4. Iteratively update u:
u = 1./(K(b./(K>u))),
until reaches certain stopping criterion.5. Compute
v = b./(K>u),
and eventuallyPε = diag(u)Kdiag(v).
![Page 41: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2](https://reader033.vdocuments.mx/reader033/viewer/2022042208/5eab9e5946801b0b723efeba/html5/thumbnails/41.jpg)
41/59
Outline
1 Problem Formulation
2 Applications
3 Entropic Regularization
4 Sinkhorn’s Algorithm
5 Sinkhorn-Newton method
6 Shielding Neighborhood Method
![Page 42: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2](https://reader033.vdocuments.mx/reader033/viewer/2022042208/5eab9e5946801b0b723efeba/html5/thumbnails/42.jpg)
42/59
Sinkhorn-Newton method
The dual problem of (21) is
minα,β
〈a, α〉+ 〈b, β〉+ ε〈e−αε ,Ke−
βε 〉,
s.t. diag(e−αε )Ke−
βε = a,
diag(e−βε )K>e−
αε = b.
(33)
with α, β being the dual variables.[1] proposes using Newton method to solve this system.
![Page 43: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2](https://reader033.vdocuments.mx/reader033/viewer/2022042208/5eab9e5946801b0b723efeba/html5/thumbnails/43.jpg)
43/59
Sinkhorn-Newton method
Let
F(α, β) =
(diag(e−
αε )Ke−
βε − a
diag(e−βε )K>e−
αε − b
). (34)
We want to find α, β such that F(α, β) = 0 so that
Pε = diag(e−αε )e−
Cε diag(e−
βε ). (35)
The Newton iteration is given by(α(k+1)
β(k+1)
)=
(α(k)
β(k)
)− J−1
F (α(k), β(k))F(α(k), β(k)), (36)
where
JF =1ε
(diag(P1n) P
P> diag(P>1m)
). (37)
![Page 44: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2](https://reader033.vdocuments.mx/reader033/viewer/2022042208/5eab9e5946801b0b723efeba/html5/thumbnails/44.jpg)
44/59
Sinkhorn-Newton method: Convergence
PropositionFor α ∈ Rm and β ∈ Rn, the Jacobian matrix JF(α, β) is symmetricpositive semidefinite, and its kernel is given by
ker(JF(α, β)) = span{(
1m
−1n
)}. (38)
ProofJF is clearly symmetric. For arbitrary γ ∈ Rm and φ ∈ Rn, one has
(γ> φ>
)JF
(γφ
)=
1ε
∑ij
Pij(γi + φj)2 ≥ 0,
which holds with equality if and only if γi + φj = 0 for all i, j, leading usto (38).
![Page 45: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2](https://reader033.vdocuments.mx/reader033/viewer/2022042208/5eab9e5946801b0b723efeba/html5/thumbnails/45.jpg)
45/59
Sinkhorn-Newton method: Convergence
LemmaLet F : D→ Rn be a continuously differentiable mapping with D ⊂ Rn openand convex. Suppose that F(x) is invertible for each x ∈ D. Assume that thefollowing affine covariant Lipschitz condition holds
‖F′(x)−1(F′(y)− F′(x))(y− x)‖ ≤ ω‖y− x‖2 (39)
for x, y ∈ D. Let F(x) = 0 have a solution x∗. For the initial guess x(0) assumethat B(x∗, ‖x(0) − x∗‖) ⊂ D and that
ω‖x(0) − x∗‖ < 2.
Then the ordinary Newton iterates remain in the open ball B(x∗, ‖x(0) − x∗‖)and converge to x∗ at an estimated quadratic rate
‖x(k+1) − x∗‖ ≤ ω
2‖x(k) − x∗‖2. (40)
Moreover, the solution x∗ is unique in the open ball B(x∗, 2/ω).
![Page 46: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2](https://reader033.vdocuments.mx/reader033/viewer/2022042208/5eab9e5946801b0b723efeba/html5/thumbnails/46.jpg)
46/59
Sinkhorn-Newton method: Convergence
ProofDenote e(k) = x(k) − x∗. Let us prove the lemma by induction:
‖e(k+1)‖ = ‖x(k) − (F′(x(k)))−1(F(x(k) − F(x∗))− x∗‖= ‖e(k) − (F′(x(k)))−1(F(x(k) − F(x∗))‖= ‖(F′(x(k)))−1((F(x∗)− F(x(k))) + F′(x(k))e(k))‖
= ‖(F′(x(k)))−1∫ −1
s=0(F′(x(k) + se(k))− F′(x(k)))e(k) ds‖
≤ ω‖∫ −1
s=0s ds‖e(k)‖2 =
ω
2‖e(k)‖2 < ‖e(k)‖.
(41)
Alsoω‖e(k+1)‖ ≤ ω‖e(k)‖ < 2. (42)
For the uniqueness part, let x(0) = x∗∗ 6= x∗ be a different solution,thenx(1) = x∗∗, then consider (40) when k = 0.
![Page 47: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2](https://reader033.vdocuments.mx/reader033/viewer/2022042208/5eab9e5946801b0b723efeba/html5/thumbnails/47.jpg)
47/59
Sinkhorn-Newton method: Convergence
Proposition
For any k ∈ N with P(k)ε,ij > 0, the affine covariante Lipschitz condition
holds in the `∞-norm for
ω ≤ (e1ε − 1)
(1 + 2e
1ε
max{‖P(k)ε 1n‖∞, ‖(P(k)
ε )>1m‖∞}minij P(k)
ε,ij
)(43)
when ‖y− x‖∞ ≤ 1.
The proof for this proposition is tedious and therefore we refer theinterested readers to the paper [1].
![Page 48: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2](https://reader033.vdocuments.mx/reader033/viewer/2022042208/5eab9e5946801b0b723efeba/html5/thumbnails/48.jpg)
48/59
Relationship with Sinkhorn’s algorithm
Let u = e−αε , v = e−
βε and K = e−C/ε. We again state the KKT system
of (21):Pε = diag(u)Kdiag(v),
a = diag(u)Kv,
b = diag(v)K>u.
(44)
Then the Sinkhorn’s algorithm amounts to alternating updates in theform of
u(k+1) = diag(Kv(k))−1a,
v(k+1) = diag(K>u(k+1))−1b.(45)
![Page 49: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2](https://reader033.vdocuments.mx/reader033/viewer/2022042208/5eab9e5946801b0b723efeba/html5/thumbnails/49.jpg)
49/59
Relationship with Sinkhorn’s algorithm
Define
G(u, v) =
(diag(u)Kv− a
diag(v)K>u− b
). (46)
Process analogously to the Sinkhorn-Newton method we justdiscussed, note that
JG(u, v) =
(diag(Kv) diag(u)K
diag(v)K> diag(K>u)
). (47)
If we neglect the off-diagonal blocks above, i.e.,
JG(u, v) =
(diag(Kv) 0
0 diag(K>u)
), (48)
and perform the Newton iteration(u(k+1)
v(k+1)
)=
(u(k)
v(k)
)− J−1
G (u(k), v(k))G(u(k), v(k)), (49)
![Page 50: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2](https://reader033.vdocuments.mx/reader033/viewer/2022042208/5eab9e5946801b0b723efeba/html5/thumbnails/50.jpg)
50/59
Relationship with Sinkhorn’s algorithm
We getu(k+1) = diag(Kv(k))−1a,
v(k+1) = diag(K>u(k))−1b.(50)
So the Sinkhorn’s algorithm simply approximates one Newton step byneglecting the off-diagonal blocks and replacing u(k) by u(k+1).
![Page 51: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2](https://reader033.vdocuments.mx/reader033/viewer/2022042208/5eab9e5946801b0b723efeba/html5/thumbnails/51.jpg)
51/59
Outline
1 Problem Formulation
2 Applications
3 Entropic Regularization
4 Sinkhorn’s Algorithm
5 Sinkhorn-Newton method
6 Shielding Neighborhood Method
![Page 52: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2](https://reader033.vdocuments.mx/reader033/viewer/2022042208/5eab9e5946801b0b723efeba/html5/thumbnails/52.jpg)
52/59
Shielding Neighborhood Method
Main features:Proposed by Bernhard Schmitzer [2] in 2016Exploits a hierarchical multiscale schemeSolves a sequence of sparse subproblems instead of one largedense problemAny existing discrete solver can be used as internal black-boxSignificant improvements both in runtime and memoryrequirements have been observed
![Page 53: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2](https://reader033.vdocuments.mx/reader033/viewer/2022042208/5eab9e5946801b0b723efeba/html5/thumbnails/53.jpg)
53/59
Shielding Neighborhood Method
DefinitionFor some N ⊂ X × Y, we call N a neighborhood.
We may consider the problem restricted to N now and then in thismethod. It is important that N is small (sparse) when compared toX × Y.
Definition(Short-Cut) For a neighborhood N ⊂ X × Y and a coupling π with sptπ ⊂ N , let ((x2, y2), . . . , (xn−1, yn−1)) be an ordered tuple of pairs inspt π . We say ((x2, y2), . . . , (xn−1, yn−1)) is a short-cut for(x1, yn) ∈ X × Y if (xi, yi+1) ∈ N for i = 1, . . . , n− 1 and
c(x1, yn) ≥ c(x1, y2) +
n−1∑i=2
(c(xi, yi+1)− c(xi, yi)). (51)
![Page 54: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2](https://reader033.vdocuments.mx/reader033/viewer/2022042208/5eab9e5946801b0b723efeba/html5/thumbnails/54.jpg)
54/59
Shielding Neighborhood Method
Find a clever way to choose a small neighborhood N such that thereis a short-cut for every (x, y) 6∈ N .
Definition(Shielding Condition) Let x ∈ X , y ∈ Y and (xs, ys) ∈ X × Y. We say(xs, ys) shields x from y when
c(x, y) + c(xs, ys) > c(x, ys) + c(xs, y).
Definition(Shielding Neighborhood) For a given coupling π, we say that aneighborhood N ⊂ X × Y , N ⊃ spt π is a shielding neighborhood iffor every pair (x, y) 6∈ N , there exists some (xs, ys) ∈ spt π with(xs, ys) ∈ N such that (xs, ys) shields x from y.
![Page 55: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2](https://reader033.vdocuments.mx/reader033/viewer/2022042208/5eab9e5946801b0b723efeba/html5/thumbnails/55.jpg)
55/59
Multiscale Scheme
Multiscale Scheme(Hierarchical Partition and Multiscale Measure Approximation) For adiscrete set X a hierarchical partition is an ordered tuple (X0, . . . ,XK)of partitions of X where X0 = {{x} : x ∈ X} is the trivial partition of Xinto singletons and each subsequent level is generated by mergingcells from the previous level.
For simplicity, we assume that the coarsest level is the trivial partitioninto one set: XK = {X}. We call K > 0 the depth of X.
This implies a directed tree graph with vertex set ∪Kk′=0Xk′ , and for
k ∈ {1, . . . ,K}, we say x′ ∈ Xj, j < k, is a descendant of x ∈ Xk whenx′ ⊂ x. We call x′ a child of x for j = k − 1, and a leaf for j = 0.
![Page 56: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2](https://reader033.vdocuments.mx/reader033/viewer/2022042208/5eab9e5946801b0b723efeba/html5/thumbnails/56.jpg)
56/59
Main Algorithm
Assume that we already have two subalgorithms:solveLocal: 2X×Y → Π(µ, ν) such that for N ∈ 2X×Y thecoupling solveLocal(N ) is locally primal optimal w.r.t. N .When N is sparse, any DOT solver can quickly provide ananswer.shield: Π(µ, ν)→ 2X×Y such that for π ∈ Π(µ, ν) theneighborhood shield(π) is shielding for π .
![Page 57: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2](https://reader033.vdocuments.mx/reader033/viewer/2022042208/5eab9e5946801b0b723efeba/html5/thumbnails/57.jpg)
57/59
Main Algorithm
Algorithm 1: Main algorithm
1 k← K.2 π ← solveDense(k) ; // this can be done immediately3 while k > 0 do4 k← k − 1.5 N1 ← {}.6 for (x, y) ∈ spt π do7 N1 ← N1∪(children(x) × children(y)).
8 π ← solveSparse(k,N1) ; // solveSparse ispresented below
9 return π.
![Page 58: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2](https://reader033.vdocuments.mx/reader033/viewer/2022042208/5eab9e5946801b0b723efeba/html5/thumbnails/58.jpg)
58/59
solveSparse
solveSparse uses current scale level k and feasible neighborhoodN1 initialized from the coarser level to compute the global optimizer πand corresponding neighborhood N at current hierarchical level.
Algorithm 2: solveSparse
1 Input: current scale level k, initial feasible neighborhood N1.2 i← 1.3 while i = 1 or C(πi) 6= C(πi−1) do4 πi+1 ← solveLocal(Ni) ; // Use any DOT solver5 Ni+1 ← shield(πi+1, k); // shield is discussed
later6 i← i + 1
7 return πi ; // returning Ni is not necessary
![Page 59: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2](https://reader033.vdocuments.mx/reader033/viewer/2022042208/5eab9e5946801b0b723efeba/html5/thumbnails/59.jpg)
59/59
References
Gabriel Peyre and Marco Cuturi, Computational OptimalTransport, ArXiv:1803.00567, 2018.https://github.com/rflamary/POT
C. BRAUER, C. CLASON, D. LORENZ, AND B. WIRTH, Asinkhorn-newton method for entropic optimal transport, arXivpreprint arXiv:1710.06635, (2017).
B. SCHMITZER, A sparse multiscale algorithm for dense optimaltransport, Journal of Mathematical Imaging and Vision, 56(2016), pp. 238–259.