riemannian optimization for high-dimensional tensor...

RIEMANNIAN OPTIMIZATION FOR HIGH-DIMENSIONAL

TENSOR COMPLETION

MICHAEL STEINLECHNER∗

Abstract. Tensor completion aims to reconstruct a high-dimensional data set where the vastmajority of entries is missing. The assumption of low-rank structure in the underlying original dataallows us to cast the completion problem into an optimization problem restricted to the manifoldof fixed-rank tensors. Elements of this smooth embedded submanifold can be efficiently representedin the tensor train (TT) or matrix product states (MPS) format with storage complexity scalinglinearly with the number of dimensions. We present a nonlinear conjugate gradient scheme within theframework of Riemannian optimization which exploits this favorable scaling. Numerical experimentsand comparison to existing methods show the effectiveness of our approach for the approximation ofmultivariate functions. Finally, we show that our algorithm can obtain competitive reconstructionsfrom uniform random sampling of few entries compared to adaptive sampling techniques such ascross-approximation.

Key words. Completion, data recovery, high-dimensional problems, low-rank approximation,tensor train format, matrix product states, Riemannian optimization, curse of dimensionality

AMS subject classifications. 15A69, 65K05, 65F99, 58C05

1. Introduction. In this paper, we consider the completion of a d-dimensionaltensor A ∈ R

n1×n2×···×nd in which the vast majority of entries are unknown. Depend-ing on the source of the data set represented by A, assuming an underlying low-rankstructure is a sensible choice. In particular, this is the case for discretizations of func-tions that are well-approximated by a short sum of separable functions. Low-rankmodels have been shown to be effective for various applications, such as electronicstructure calculations and approximate solutions of stochastic and parametric PDEs,see [19] for an overview. With this assumption, the tensor completion problem canbe cast into the optimization problem

minX

f(X) :=1

2‖PΩ X− PΩ A‖2 (1.1)

subject to X ∈Mr :=X ∈ R

n1×n2×···×nd∣∣ rankTT(X) = r

.

where rankTT(X) denotes the tensor train rank of the tensor X, a certain general-ization of the matrix rank to the tensor case, see Section 3 for details. We denotewith PΩ the projection onto the sampling set Ω ⊂ 1, 2, . . . , n1 × · · · × 1, 2, . . . , ndcorresponding to the indices of the known entries of A,

PΩ X :=

X(i1, i2, . . . , id) if (i1, i2, . . . , id) ∈ Ω,

0 otherwise.(1.2)

In the two-dimensional case d = 2, tensor completion (1.1) reduces to the extensivelystudied matrix completion for which convergence theory and efficient algorithms havebeen derived in the last years, see [31] for an overview. The number of entries in the

∗MATHICSE-ANCHP, Section de Mathematiques, Ecole Polytechnique Federale de Lausanne,1015 Lausanne, Switzerland. E-mail: [email protected] work has been supported by the SNSF research module Riemannian optimization for solvinghigh-dimensional problems with low-rank tensor techniques within the SNSF ProDoc Efficient Nu-merical Methods for Partial Differential Equations. Parts of this work have been conducted duringa research stay at PACM, Princeton University, with a mobility grant from the SNSF.

1

2 M. STEINLECHNER

optimization variable X ∈ Rn1×n2×···×nd scales exponentially with d, the so-called

curse of dimensionality. Hence, the naive approach of reshaping the given data A

into a large matrix and applying existing algorithms for matrix completion is compu-tationally infeasible for all but very small problem sizes.

Assuming low-rank structure in the different tensor formats along with their no-tion of tensor rank allows us to reduce the degrees of freedom significantly. For anoverview of the most common choices of tensor formats, we refer to the literaturesurvey by Grasedyck et al. [19]. The most popular choice, the Tucker tensor format,reduces the storage complexity from O(nd) to O(dnr + rd) and thus yields good re-sults for not too high-dimensional problems, say d ≤ 5. Algorithms developed for theTucker format range from alternating minimization [27, 28] and convex optimization,using various generalizations of the nuclear norm [27, 28, 15, 41, 42, 43], to Rieman-nian optimization on the embedded submanifold of tensors with fixed Tucker rank[26].

If higher dimensional problems are considered, the Tucker format can be replacedby the hierarchical Tucker (HT) or alternatively the tensor train / matrix productstates (TT/MPS) formats. In the TT format, for example, the number of unknownsin a tensor with fixed rank r is scaling like O(dnr2), which potentially allows for theefficient treatment of problems with a much higher number of dimensions d than theTucker format.

This paper is a generalization of the Riemannian approach presented in [48, 26]to a higher-dimensional setting using this TT format. We obtain an algorithm calledRTTC that scales linear in the number of dimensions d, in the tensor size n and inthe size of the sampling set |Ω| and scales polynomially in the tensor train rank r.This scaling allows us to apply the algorithm to very high-dimensional problems.

For the HT format, Da Silva and Herrmann [10, 11] have derived HTOpt, aclosely related Riemannian approach. By exploiting the simpler structure of the TTformat, the algorithm presented in this paper exhibits a better asymptotic complexityof O(d|Ω|r2) instead of O(d|Ω|r3) for HTOpt. Furthermore, our implementation isspecifically tuned for very small sampling set sizes. In the TT format Grasedyck et al.[17] presented an alternating direction method with similar computational complexityof O(d|Ω|r2). We will compare these two methods to our algorithm and to a simplealternating least squares approach.

Note that for tensor completion, the index set Ω of given entries is fixed. If werelax this constraint and instead try to reconstruct the original tensor A from as fewentries as possible, black-box approximation techniques such as cross-approximationare viable alternatives. In the matrix case, cross approximation was first developedfor the approximation of integral operators [6, 7, 8] and as the so-called Skeletondecomposition [16, 45]. Extensions to high dimensional tensors were developed byOseledets et al. [36, 39] for the TT format and by Ballani et al. [5, 4] for the HTformat. In these techniques, the sampling points are chosen adaptively during theruntime of the algorithm. In this paper, we will compare them to our proposedalgorithm and show in numerical experiments that choosing randomly distributedsampling points beforehand is competitive in certain applications.

We will first state in Section 2 the final Riemannian optimization algorithm togive an overview of the necessary ingredients and quantities. Then we will briefly re-capitulate the TT format and its geometric properties in Section 3, and explain eachindividual part of the optimization algorithm in detail in Section 4. We will focuson efficient computation of each part and give detailed analysis of the involved com-

RIEMANNIAN OPTIMIZATION FOR TENSOR COMPLETION 3

putational complexity to successfully approach the tensor completion problem (1.1).Finally, we apply our algorithm to several test cases and compare its performance toother existing methods in Section 5.

2. A Riemannian nonlinear conjugate gradient scheme. To solve the ten-sor completion problem (1.1), we will use a nonlinear conjugate gradient scheme re-stricted toMr, the set of TT tensors with fixed rank. As we will discuss in Section 3,this set forms a smooth embedded submanifold of R

n1×n2×···×nd and we can applya Riemannian version [1, Sec. 8.3] of the usual nonlinear CG algorithm. The maindifference is the fact that we need to make sure that every iterate Xk of the optimiza-tion scheme stays on the Manifold Mr and that we use the Riemannian gradient ξk,which points to the direction of greatest increase within the tangent space TXk

Mr.

In the first step, we move from the starting point X0 ∈ Mr in the directionof the negative Riemannian gradient η0 = −ξ0. This yields a new iterate which is,in general, not an element of Mr anymore. Hence, we will need the concept of aretraction R which maps the new iterate X0 + α0η0 back to the manifold. Startingfrom the second step, we begin with the conjugate gradient scheme. For this, weneed to compare the current gradient ξk ∈ TXk

Mr with the previous search directionηk−1 ∈ TXk−1

Mr, and hence need to bring Xk−1 into the current tangent space usingthe vector transport TXk−1→Xk

.

Algorithm 1 RTTC: Riemannian CG for Tensor Train Completion, see also [26]

Input: Initial guess X0 ∈Mr.ξ0 ← grad f(X0) % compute Riemannian gradient (Sec. 4.2)

η0 ← −ξ0 % first step is steepest descent

α0 ← argminα f(X0 + αη0) % step size by linearized line search (Sec 4.5)

X1 ← R(X0, α0η0) % obtain next iterate by retraction (Sec. 4.3)

for k = 1, 2, . . . doξk ← grad f(Xk) % compute Riemannian gradient

ηk ← −ξk + βkTXk−1→Xkηk−1 % conjugate direction by updating rule (Sec. 4.4)

αk ← argminα f(Xk + αηk) % step size by linearized line search

Xk+1 ← R(Xk, αkηk) % obtain next iterate by retraction

end for

The resulting optimization scheme, which we call RTTC (Riemannian TensorTrain Completion), is given by Algorithm 1. Note that the basic structure does notdepend on the representation of X as a TT tensor and is exactly the same as inthe Tucker case [26]. We will derive and explain each part of the algorithm in thefollowing Section 4: the computation of the Riemannian gradient in Sec. 4.2, the linesearch procedure in Sec. 4.5, the retraction in Sec. 4.3, the direction updating ruleand vector transport in Sec. 4.4 and the stopping criteria in Sec. 4.6.

3. The geometry of tensors in the TT format. In this section, we will give abrief overview of the TT representation and fix the notation for the rest of this paper.The notation follows the presentation in [25]. For a more detailed description of theTT format, we refer to the original paper by I. Oseledets [34]. We will review themanifold structure of TT tensors with fixed rank and present different representationsof elements in the tangent space. This will then allow us to derive our optimizationscheme in Section 4.

4 M. STEINLECHNER

3.1. The TT representation of tensors. A d-dimensional tensor X ∈ Rn1×n2×···×nd

can be reshaped into a matrix

X<µ> ∈ R(n1n2···nµ)×(nµ+1···nd),

the so-called µth unfolding of X, see [34] for a formal definition of the ordering ofthe indices. This allows us to define the tensor train rank rankTT as the following(d + 1)-tuple of ranks,

rankTT(X) = r = (r0, r1, . . . , rd) :=(1, rank(X<1>), . . . , rank(X<d−1>), 1

),

where we define r0 = rd = 1.Given these ranks (r0, . . . rd), it is possible to write every entry of the d-dimensional

tensor X ∈ Rn1×n2×···×nd as a product of d matrices,

X(i1, i2, . . . , id) = U1(i1)U2(i2) · · ·Ud(id), (3.1)

where each Uµ(iµ) is a matrix of size rµ−1×rµ for iµ = 1, 2, . . . , nµ. The slices Uµ(iµ),iµ = 1, 2, . . . , nµ can be combined into third order tensors Uµ of size rµ−1 × nµ × rµ,the cores of the TT format. In terms of these core tensors, equation (3.1) can also bewritten as

X(i1, . . . , id) =

r1∑

k1=1

· · ·rd−1∑

kd−1=1

U1(1, i1, k1)U2(k1, i2, k2) · · ·Ud(kd−1, id, 1).

To be able to exploit the product structure of (3.1), we now focus on ways tomanipulate individual cores. We define the left and the right unfoldings,

UL

µ ∈ Rrµ−1nµ×rµ , and UR

µ ∈ Rrµ−1×nµrµ

by reshaping Uµ ∈ Rrµ−1×nµ×rµ into matrices of size rµ−1nµ × rµ and rµ−1 × nµrµ,

respectively. Additionally, we can split the tensor X into left and right parts, theinterface matrices

X≤µ(i1, . . . , iµ) = U1(i1)U2(i2) · · ·Uµ(iµ), X≤µ ∈ Rn1n2···nµ×rµ ,

X≥µ(iµ, . . . , id) = [Uµ(iµ)Uµ+1(iµ+1) · · ·Ud(id)]T, X≥µ ∈ R

nµnµ+1···nd×rµ−1 .

Figure 3.1 depicts a TT tensor of order 6. The interface matrices can be constructedrecursively:

X≤µ = (Inµ⊗X≤µ−1)UL

µ, and X≥µ = (X≥µ+1 ⊗ Inµ)(UR

µ)T, (3.2)

where Im denotes the m×m identity matrix and ⊗ the Kronecker product.The following formula connects the µth unfolding of a tensor with its interface

matrices:

X<µ> = X≤µXT

≥µ+1.

We call X µ-orthogonal if

(UL

ν)TUL

ν = Irν , and hence XT

≤νX≤ν = Irν for all ν = 1, . . . , µ− 1,

UR

ν (UR

ν )T = Irν−1, and hence X≥νX

T

≥ν = Irν−1, for all ν = µ + 1, . . . , d.


X≤3 X≥5n1

U1

n2

U2

n3

U3

n4

U4

n5

U5

n6

U6

r1 r2 r3 r4 r5

Fig. 3.1: Representation of a TT tensor of order d = 6. As r0 = rd = 1, they areusually omitted. The two interface matrices X≤3 and X≥5 are shown. See also [25].

The tensor X is called left-orthogonal if µ = d and right-orthogonal if µ = 1. Orthogo-nalizing a given TT tensor can be done efficiently using recursive QR decompositions,as we have for the left-orthogonal part:

X≤µ = Q≤µRµ, with Q≤µ = (Inµ⊗Q≤µ−1)QL

µ and Q≤0 = 1.

Thus, starting from the left, we compute a QR decomposition of the first core, UL

1 =QL

1R1. The orthogonal factor is kept as the new, left-orthogonal first core and thenon-orthogonal part is shifted to the second core. Then, a QR decomposition of thisupdated second core is performed, and so on and so forth. A similar relation holds forthe right-orthogonal part, see [34] for more details. Moving from a µ-orthogonal tensorto a (µ + 1)-orthogonal one only requires a QR decomposition of UL

µ. An importantconsequence that we will use in Section 3.3 is that changing the orthogonalization ofa tensor only affects its internal representation and does not change the tensor itself.

As an inner product of two TT tensors we take the standard Euclidean innerproduct of the vectorizations vec(X), vec(Y) ∈ R

n1n2···nd :

〈X,Y〉 = 〈vec(X), vec(Y)〉 = trace((X<µ>)TY<µ>

)for any µ ∈ 1, . . . d. (3.3)

The norm of a TT tensor is induced by the inner product. Note that if X is µ-orthogonal, then we obtain

‖X‖ =√〈X,X〉 = ‖Uµ‖ = ‖UL

µ‖F = ‖ vec(Uµ)‖,

so we only have to compute the norm of the µth core.The recursive relation (3.2) allows us to extend the singular value decomposition

to TT tensors, the TT-SVD [34] procedure. Let X be d-orthogonal. By taking theSVD of UR

d = QSV T and splitting the product to the adjacent cores:

UR

d ← V T, UL

d−1 ← UL

d−1QS,

the tensor is now (d− 1)-orthogonalized. By keeping only s singular values, the rankrd−1 between Ud−1 and Ud is reduced to s. Repeating this procedure till core U1

allows us to truncate the rank r of X to any prescribed value r, with rµ ≤ rµ forall µ = 1, . . . , d. Similar to the best-approximation property of the matrix SVD, thetruncated X fulfills the following inequality:

‖X− X‖ ≤√d− 1‖X− PMr

(X)‖, (3.4)

with PMr(X) being the best approximation to X within the set of rank-r TT tensors,

see [34, Cor. 2.4] for a proof.

6 M. STEINLECHNER

3.2. Manifold of TT tensors of fixed rank. It was shown by Holtz et al. [22]that the set

Mr =X ∈ R

n1×···×nd∣∣ rankTT(X) = r

forms a smooth embedded submanifold of Rn1×···×nd of dimension

dimMr =

d∑

µ=1

rµ−1nµrµ −d−1∑

µ=1

r2µ. (3.5)

The inner product (3.3) induces an Euclidean metric on the embedding space Rn1×···×nd .When equipped with this metric,Mr is turned into a Riemannian manifold. An anal-ogous result is also available for the HT format [46].

3.3. The tangent space TXMr. Assume X ∈ Rn1×···×nd is represented in the

TT format as

X(i1, . . . , id) = U1(i1)U2(i2) · · ·Ud(id)

with left-orthogonal cores Uµ,(UL

µ

)TUL

µ = Irµ , for all µ = 1, . . . , d − 1. By theproduct rule, any element of the tangent space TXMr can be uniquely represented inthe following way:

Y(i1, . . . , id) = δU1(i1)U2(i2) · · ·Ud(id)

+ U1(i1)δU2(i2) · · ·Ud(id) (3.6)

+ · · ·+ U1(i1)U2(i2) · · · δUd(id),

with the gauge condition(UL

µ

)TδUL

µ = 0, for all µ = 1, . . . , d− 1. There is no gaugecondition for the last core. This representation has been introduced by Holtz et al.[22].

For the orthogonal projection PTXMr: Rn1×···×nd → TXMr of an arbitrary tensor

Z ∈ Rn1×···×nd , Lubich et al. [30] derived the following explicit expression1 for the

first order variations δUµ:

δUL

µ =(Inµrµ−1

− Pµ

) (Inµ⊗X≤µ−1

)TZ<µ>X≥µ+1

(XT

≥µ+1X≥µ+1

)−1, (3.7)

where Pµ denotes the orthogonal projection onto the range of UL

µ. This expressionis relatively straight-forward to implement using tensor contractions along the modes

1, . . . , µ − 1, µ + 1, . . . , d. Note that the inverse(XT

≥µ+1X≥µ+1

)−1potentially leads

to numerical instabilities or is not even well-defined: For example, XT≥µ+1X≥µ+1 is

singular in case the right unfoldings UR

µ+1, . . . ,UR

d do not have full rank.To circumvent these numerical instabilites, we will work with a different parametriza-

tion of the tangent space first proposed in [24]: Observe that in the representation(3.6), each individual summand is a TT tensor which can be orthogonalized individu-ally without changing the overall tangent tensor. Hence, to obtain the µth summand,we can µ-orthogonalize X,

X = U1(i1) · · ·Uµ−1(iµ−1)Uµ(iµ)Vµ+1(iµ+1) · · ·Vd(id)

1Note that the expression is stated for its left unfolding δULµ. From this, δUµ can be directly

obtained by reshaping. This notation will be used several times throughout this manuscript.


where the cores denoted by the letter U are left-orthogonal, V are right-orthogonaland U are neither left- nor right-orthogonal. Then, the resulting first-order variationis given by

U1(i1) · · ·Uµ−1(iµ−1)δUµ(iµ)Vµ+1(iµ+1) · · ·Vd(id). (3.8)

The first order variations δUµ still fulfill the gauge condition (UL

µ)TδUL

µ = 0, µ =1, . . . , d − 1. Because of the right-orthogonalization of Xµ+1, the projection (3.7)simplifies to

δUL

µ =(Inµrµ−1

− Pµ


)TZ<µ>X≥µ+1, (3.9)

because now(XT

≥µ+1X≥µ+1

)−1is the identity matrix. Hence, we are able to circum-

vent the explicit inverse. Due to the individual orthogonalizations of the summands,a tangent tensor can then be written in the form

Y(i1, . . . , id) = δU1(i1)V2(i2) · · ·Vd(id)

+ U1(i1)δU2(i2) · · ·Vd(id) (3.10)

+ · · ·+ U1(i1)U2(i2) · · · δUd(id).

Making use of the structure of the sum (3.10), we observe that a tangent tensor inTXMr can also be represented as a TT tensor:

Remark 1. Let Y ∈ TXMr be given in the representation (3.10). Then, it canbe identified with a TT tensor of TT-rank at most (1, 2r1, 2r2, . . . , 2rd−1, 1):

Y(i1, . . . , id) = W1(i1)W2(i2) · · ·Wd(id), (3.11)

with cores given for µ = 2 . . . d− 1 by

W1(i1) =[δU1(i1) U1(i1)

], Wµ(iµ) =

[Vµ(iµ) 0

δUµ(iµ) Uµ(iµ)

], Wd(id) =

[Vd(id)

δUd(id)

].

This identity can be easily verified by multiplying out the matrix product. The resultingcores Wµ are of size 2rµ−1 × nµ × 2rµ for µ = 2, . . . , d− 1.

This result allows us to efficiently handle elements of the tangent space in thesame way as elements in the manifold, being able to directly reuse all implementedoperations.

4. A Riemannian nonlinear conjugate gradient scheme. With the basicproperties of elements inMr and TXMr discussed, we can now describe all necessarysteps of RTTC in detail. For each part, we will also mention its computationalcomplexity. To simplify the presentation, we will state the operation count in termsof r := maxµ rµ and n := maxµ nµ.

4.1. Evaluation of the cost function and sampling of TT tensors. Inthe tensor completion problem (1.1), the cost function is given by f(X) = ‖PΩX −PΩA‖2/2. As the data samples PΩ A are supplied as input, evaluating the costfunction reduces to evaluating the entries of the sparse tensor PΩ X.

8 M. STEINLECHNER

Sampling of TT tensors. We can compute the application of the sampling opera-tor PΩ, see (1.2) for its definition, to a TT tensor X by direct evaluation of the matrixproduct for each sampling point:

X(i1, i2, . . . , in) = U1(i1)U2(i2) · · ·Ud(id).

Thus, for each sampling point, we need to compute d − 1 matrix-vector multiplica-tions. Hence, evaluating the tensor at |Ω| sampling points involves O(|Ω|(d − 1)r2)operations.

Sampling of TT tangent tensors. Writing the tangent tensor in the form (3.11)we can reuse the sampling routine for TT tensors with an acceptable amount of extracost — the rank is doubled, hence the number of operations is quadrupled.

4.2. Riemannian gradient. To obtain the descent direction in our optimiza-tion scheme, we need the Riemannian gradient of the objective function f(X) =12‖PΩX− PΩA‖2.

Proposition 4.1 ([1, Chap. 3.6]). Let f : Rn1×···×nd → R be a cost functionwith Euclidean gradient ∇fX at point X ∈ Mr. Then the Riemannian gradient off :Mr → R is given by grad f(X) = PTXMr

(∇fX).Hence, the Riemannian gradient is obtained by projecting the Euclidean gradient

∇f = PΩ X− PΩ A onto the tangent space:

grad f(X) = PTXMr(PΩX− PΩA). (4.1)

As the Euclidean gradient Z = PΩ X−PΩ A ∈ Rn1×···×nd is a very large but sparse

tensor, the projection onto the tangent space has to be handled with care. Whensuccessively calculating the projection (3.9), the intermediate results are, in general,dense already after the first step. For example, the intermediate result Z<µ>X≥µ+1

is a huge dense matrix of size Rn1n2···nµ×rµ whereas the resulting δUL

µ is only of size

Rrµnµ×rµ+1 . Let us restate the equations of the first order variations δUµ:

δUL

µ = (Irµnµ− Pµ)(Inµ

⊗X≤µ−1)TZ<µ>X≥µ+1 , for µ = 1, . . . , d− 1

δUL

d = (Ind⊗X≤d−1)TZ<d>.

The projection (Irµnµ− Pµ) is a straight-forward matrix multiplication using Pµ =

QµQT

µ, where Qµ is an orthogonal basis for the range of UL

µ obtained from a QR

decomposition and can be done separately at the end. Instead, we focus on CL

µ :=

(Iµ ⊗ X≤µ−1)TZ<µ>X≥µ+1. Let Θiµ ⊂ Ω be the set of sampling points (j1, . . . , jd)where the µth index jµ coincides with iµ:

Θiµ = (j1, . . . jd) ∈ Ω | jµ = iµ ⊂ Ω. (4.2)

The iµth slice of Cµ is given by a sum over Θiµ , as the other entries do not contribute:

C(iµ) =∑

(j1,...jd)∈Θiµ

X≤µ−1(j1, . . . , jµ−1)TZ(j1, . . . , jd)X≥µ+1(jµ+1, . . . , jd),

which can be reformulated into

C(iµ) =∑

(j1,...jd)∈Θiµ

Z(j1, . . . , jd)[U1(j1) · · ·Uµ−1(jµ−1)

]T[Vµ+1(jµ+1) · · ·Vd(jd)

].


The algorithmic implementation for all first order variations δUµ for µ = 1, . . . , dis shown in Algorithm 2, with reuse of intermediate matrix products. For each entry,we have to perform d − 1 small matrix-vector products for a total cost of O(d|Ω|r2).The projection onto the orthogonal complement at the end adds another O(dnr3)operations.

Algorithm 2 Computing the Riemannian gradient

Input: Left-orth. cores Uµ and right-orth. cores Vµ of X ∈Mr, Euclid. gradient Z

Output: First order variations δUµ for µ = 1, . . . , d.% Initialize cores

for µ = 1, . . . , d do

Cµ ← zeros(rµ, nµ, rµ+1)end for

for (j1, j2, . . . , jd) ∈ Ω do

% Precompute left matrix product

UL1 ← U1(j1)for µ = 2, . . . , d− 1 do

ULµ ← ULµ− 1U2(j2)end for

VR ← Vd(jd)

% Calculate the cores beginning from the right

Cd(jd)← Cd(jd) + Z(j1, j2, . . . , jd) ·ULd− 1Tfor µ = d− 1, . . . , 2 do

Cµ(jµ)← Cµ(jµ) + Z(j1, j2, . . . , jd) ·ULµ− 1TVT

R

VR ← Vµ(jµ)VT

R

end for

C1(j1)← C1(j1) + Z(j1, j2, . . . , jd) ·VT

R

end for

δUd ← Cd

% Project cores 1, . . . , d− 1 onto orth. complement of the range of UL

µ

for µ = 1, . . . , d− 1 do

[Qµ, Rµ] = qr(UL

µ)

δUL

µ ← CL

µ −Qµ(QT

µ)CL

µ

end for

4.3. Retraction. Taking one step from the current iterate Xk ∈ Mr in thesearch direction ξ ∈ TXMr yields a new point in the tangent space which is, ingeneral, not an element of Mr anymore. In Riemannian geometry, the well-knownexponential map provides a local diffeomorphism between the neighborhood of Xk

and the neighborhood around the origin of TXMr. Assuming not-too-large step sizes,the exponential map thus allows us to map the new point Xk + ξ back onto themanifold. While this map is given in a simple form for some manifolds, such as thesphere, it is computationally infeasible for our manifold Mr. However, as proposedin [1], a first order approximation of this map, a so-called retraction, is sufficient toensure convergence of gradient-based optimization schemes.

Definition 4.2 (Retraction, [2, Def. 1]). Let M be a smooth submanifold ofR

n1×···×nd . Let 0x denote the zero element of TxM. A mapping R from the tangent

10 M. STEINLECHNER

bundle TM into M is said to be a retraction on M around x ∈ M if there exists aneighborhood U of (x, 0x) in TM such that the following properties hold:

(a) We have U ⊆ dom(R) and the restriction R : U →M is smooth.(b) R(y, 0y) = y for all (y, 0y) ∈ U .(c) With the canonical identification T0xTxM≃ TxM, R satisfies the local rigid-

ity condition:

DR(x, ·)(0x) = idTxM for all (x, 0x) ∈ U ,

where idTxM denotes the identity mapping on TxM.For matrices, multiple variants of retractions into the low-rank manifold have been

discussed in a recent survey paper by Absil and Oseledets [3]. For Mr, a retractionmaps the new point to a rank-r tensor. A computationally efficient method is given bythe TT-SVD procedure which satisfies the quasi-best approximation property (3.4).This is very similar to the Tucker case discussed in [26], where it was shown thatthe HOSVD (which posesses an analogous quasi-best approximation property) fulfillsindeed all necessary properties of a retraction map.

Proposition 4.3. Let PTTr : Rn1×···×nd → Mr denote the projection operator

performing rank truncation using the TT-SVD. Then,

R : TMr →Mr, (X, ξ) 7→ PTTr (X + ξ) (4.3)

defines a retraction onMr around X according to Def. 4.2.Proof. (a) There exists a small neighborhood of U ⊂ R

n1×···nd of X ∈ Mr suchthat the TT-SVD operator PTT

r : U → Mr is smooth, as it basically only consistsof unfoldings and sequential applications of locally smooth SVDs, moving from onecore to the next. Thus, the retraction (4.3) is locally smooth around (X, 0X), as itcan be written as R(X, ξ) = PTT

r p(X, ξ) with the smooth function p : TMr →R

n1×···×nd , p(X, ξ) = X + ξ.(b) We have the obvious identity PTT

r (X) = X for all X ∈Mr.(c) We use the quasi-optimality (3.4). We have that ‖(X+tξ)−PMr

(X+tξ)‖ = O(t2)for t→ 0, as TXMr is a first order approximation of Mr around ξ. Hence,

‖(X + tξ)−R(X, tξ)‖ ≤√d− 1‖(X + tξ)− PMr

(X + tξ)‖ = O(t2).

Therefore, R(X, tξ) = (X + tξ) + O(t2) and ddtR(X, tξ)

∣∣∣t=0

= ξ, which is equivalent

to DR(X, ·)(0X) = idTXMr.

To efficiently apply the TT-SVD procedure, we exploit the representation (3.10)and notice that the new point, obtained by a step Y ∈ TXMr of length α from thepoint X ∈Mr, can be written as

X(i1, . . . , id) + αY(i1, . . . , id) =[αV1(i1) U1(i1)

] [ V2(i2) 0

αδU2(i2) U2(i2)

]· · ·

· · ·[

Vd−1(id−1) 0

αδUd−1(id−1) Ud−1(id−1)

] [Vd(id)

Ud(id) + αδUd(id)

].

Hence, retracting back to the rank-r manifold M is equivalent to truncating a rank-2r TT tensor to rank r, for a total cost of approximately O(dnr3). We refer to [34,Section 3] for a detailed description on the TT-SVD procedure and its computational

complexity. Furthermore, the orthogonality relations between Uµ and δUµ can beused to save some work when reorthogonalizing.


4.4. Search direction update and vector transport. In the nonlinear conju-gate gradient scheme, the new search direction ηk is computed as a linear combinationof the current gradient ξk and the previous direction ηk−1 scaled by a factor βk,

ηk = −ξk + βkTXk−1→Xkηk−1,

where TXk−1→Xkis the vector transport which maps elements from TXk−1

Mr toTXkMr, see [1] for a formal definition. The orthogonal projection onto the tangent

space, PTX+1Mr, yields a vector transport, as Mr is an embedded submanifold of

Rn1×···×nd , see [1, Sec. 8.1.3].

To compute the projection of a certain tangent tensor Y onto TXMr, we makeuse of the identity (3.11), reducing it to the projection of a TT tensor of rank 2r ontoTXMr. Let Y thus be represented as

Y(i1, . . . , id) = W1(i1)W2(i2) · · ·Wd(id).

Then, we have to compute the first order variations

δUL

µ =(Inµrµ−1

− Pµ


)TY〈µ〉X≥µ+1

=(Inµrµ−1

− Pµ


)TY≤µY

T

≥µ+1X≥µ+1,

which, for each µ, can be efficiently evaluated as a tensor contraction along dimensions1, . . . d except mode µ. This can be checked using the recursive relations (3.2),

YT

≥µ+1X≥µ+1 = WR

µ+1(YT

≥µ+2 ⊗ Inµ+1)(X≥µ+2 ⊗ Inµ+1

)(VR

µ+1)T

= WR

µ+1(YT

≥µ+2X≥µ+2 ⊗ Inµ+1)(VR

µ+1)T

and YT

≥dX≥d = WR

d (VR

d )T. An analogous relation holds for XT

≤µ−1Y≤µ−1.

As these quantities are shared between all δUµ, we first precompute YT

≥µX≥µ

for all µ = 2, . . . , d and XT

≤µY≤µ for µ = 1, . . . , d − 1, using O(dr3n) operations.

Combining this with the orthogonal projection(Inµrµ−1

− Pµ

)yields a total amount

of O(dr3n) flops.Many options exist for the choice of βk within the framework of nonlinear conju-

gate gradient. Here, we choose the Fletcher-Reeves update,

βk =‖ξk‖‖ξk−1‖

,

as the computation of the norm of a tangent tensor ξ ∈ TXMr is particularly simplewhen using the representation (3.10). Due to the µ-orthogonality of each term in the

sum, we have that ‖ξ‖2 =∑d

µ=1 ‖δUµ‖2. Hence we only need about 2dnr2 operationsto compute βk. In our experiments, we observed that choosing a different scheme, suchas Polak-Ribiere+, has almost no influence on the convergence of RTTC.

4.5. Line search procedure. Determining a good step size α in RTTC is cru-cial to ensure fast convergence of the method. The optimal choice would be the min-imizer of the objective function along the geodesic determined by the current searchdirection ηk: αk = argminα f(R(Xk + αηk)). As described in [48, 26], this nonlinearoptimization problem can be linearized and then solved explicitly by dropping theretraction R and thereby only minimizing within the tangent space,

αk ≈ argminα

f(Xk + αηk) =〈PΩ ηk,PΩ(A−Xk)〉〈PΩ ηk,PΩ ηk〉

.

12 M. STEINLECHNER

This estimate can be calculated in O(|Ω|(d − 1)r2) operations. The accuracy of thisapproximation depends on the curvature of the manifold at the current iterate, butin all our experiments, we have never observed a situation where this estimate wasnot good enough. To make sure that the iterates fulfill the Wolfe conditions, a simpleArmijo backtracking scheme can be added, see e.g. [33].

4.6. Stopping criteria. During the course of the optimization procedure, wemonitor the current relative error on the sampling set,

εΩ(Xk) =‖PΩ A− PΩ Xk‖

‖PΩ A‖ .

This requires almost no extra computations, as the Euclidean gradient ∇f = PΩ A−PΩ Xk is already calculated in each step.

When the sampling rate is very low or the estimated rank r is chosen largerthan the true rank of the underlying data, one can often observe overfitting. Inthese cases, the residual error does converge to zero as expected, but the obtainedreconstruction is a bad approximation of the original data. This is to be expectedas the degrees of freedom in the model exceed the information available. While thisintrinsic problem cannot be completely avoided, it is helpful to be able to detect thesecases of overfitting. Then, we can act accordingly and reduce the degrees of freedomof the model (by choosing a lower estimated rank r) or by increasing the number ofsampling points, if possible. Hence, we also measure the relative error on a controlset ΩC , ΩC ∩ Ω = ∅:

εΩC(Xk) =

‖PΩCA− PΩC

Xk‖‖PΩC

A‖ .

This control set is either given or can be obtained by random subsampling of thegiven data. Numerical experiments show that it usually suffices to choose only fewcontrol samples, say, |ΩC | = 100. RTTC is run until either the prescribed toleranceis attained for εΩ(Xk) and εΩC

(Xk) or the maximum number of iterations is reached.Furthermore, to detect stagnation in the optimization procedure, we check if therelative change in εΩ and εΩC

between two iterates is below a certain threshold δ:

|εΩ(Xk)− εΩ(Xk+1)||εΩ(Xk)| < δ,

|εΩC(Xk)− εΩC

(Xk+1)||εΩC

(Xk)| < δ. (4.4)

In practice, we choose δ ≈ 10−4, but adjustments can be necessary depending on theunderlying data.

4.7. Scaling of the algorithm: computational complexity. If we sum upthe computational complexity needed for one iteration of RTTC, we arrive at a totalcost of

O(dnr3 + d|Ω|r2) (4.5)

operations. Note that for reconstruction, the number of samples |Ω| should beO(dnr2), as it needs to be at least as large as the dimension of the manifold (3.5).Therefore, in realistic cases, the algorithm will have a complexity of O(d2nr4).


4.8. Convergence of the algorithm. The convergence analysis of RTTC fol-lows the same steps as the analysis employed in the matrix [48] and the Tucker tensorcase [26].

First, to fulfill the assumptions of standard convergence results for Riemannianline search methods, we have to safeguard the heuristic linearized line search with anArmijo-type backtracking scheme. Then, we can apply [1, Theorem 4.3.1] to obtain

Proposition 4.4. Let (Xk)k∈N be an infinite sequence of iterates generated byRTTC with backtracking linesearch. Then, every accumulation point X∗ of (Xk) isa critical point of the cost function f and hence satisfies grad f(Xk) = 0, that is,PTX∗

Mr(PΩ X∗) = PTX∗

Mr(PΩ A).

This result shows us that in the tangent space, reconstruction is achieved for alllimit points of RTTC. Unfortunately, as Mr is not closed, X∗ is not necessarily anelement ofMr anymore — and hence not a valid solution of problem (1.1). To enforcethis, we regularize the cost function in such a way that the iterates Xk stay inside acompact subset of Mr. Analogous to [26], we define the modified cost function g as

g :Mr → R, X 7→ f(X) + λ2d∑

µ=1

(‖X<µ>‖2F + ‖(X<µ>)†‖2F

), λ > 0, (4.6)

where (X<µ>)† denotes the pseudo-inverse of X<µ> and λ is the regularization pa-rameter. With this modification regularizing the singular values of the matricizationsof X, we obtain the following result:

Proposition 4.5 (c.f. [26, Prop. 3.2]). Let (Xk)k∈N be an infinite sequence ofiterates generated by Algorithm 1 but with the modified cost function g defined in (4.6).Then limk→∞ ‖ grad g(Xk)‖ = 0.

Proof. The cost function is guaranteed to decrease due to the line search, g(Xk) ≤g(X0) =: C2

0 , which yields λ2∑d

µ=1 ‖X<µ>k ‖2F ≤ C2

0 and λ2∑d

µ=1 ‖(X<µ>k )†‖2F ≤ C2

0 .From this, we can deduce upper and lower bounds for the largest and smallest singularvalues of the matricizations,

σmax(X<µ>k ) ≤ ‖X<µ>

k ‖ ≤ C0λ−1, σ−1

min(X<µ>k ) ≤ ‖(X<µ>

k )†‖ ≤ C0λ−1.

Therefore, all iterates are guaranteed to stay inside the compact set

B :=X ∈Mr

∣∣∣σmax(X<µ>) ≤ C0λ−1, σmin(X<µ>) ≥ C−1

0 λ, µ = 1, . . . , d.

Now, as B is compact, the sequence (Xk) has an accumulation point inside B and wecan use Proposition 4.4 to obtain limk→∞ ‖ grad g(Xk)‖ = 0.

Note that the regularization parameter λ can be chosen arbitrarily small. If thecore unfoldings are always of full rank during the optimization procedure (even whenλ → 0), then the accumulation points X∗ are guaranteed to stay inside Mr andgrad f(X∗) = PTX∗

Mr(PΩ X∗ − PΩ A) → 0 as λ → 0. In this case, optimizing the

modified cost function (4.6) is equivalent to the original cost function.Asymptotic convergence rates are, unfortunately, not available for Riemannian

conjugate gradient schemes. For steepest descent, linear convergence can be shown,with a constant depending on the eigenvalues of the Riemannian Hessian at the criticalpoint, see [1, Theorem 4.5.6].

A very different approach which avoids the need for regularization has been dis-cussed in a recent work by Schneider and Uschmajew [40], extending Riemannianoptimization techniques to the algebraic variety M≤r := X ∈ R

n×n| rank(X) ≤ r,

14 M. STEINLECHNER

the closure of Mr. Convergence results for the therein proposed gradient schemefollow from Lojasiewicz inequality. Extension of this work to the Tensor Train formatremains an open question.

4.9. Adaptive rank adjustment. There is an obvious disadvantage in the waythe tensor completion problem (1.1) is defined: The rank of the underlying problemhas to be known a priori. For high-dimensional TT tensors, this problem is muchmore severe than in the matrix case, as we not only have one unknown rank, but awhole rank tuple r = (r0, r1, . . . , rd). For real-world data, this information is usuallynot available and the ranks can be very different from each other, see e.g. [34, Table3.2] for a common distribution of the ranks.

Instead, we can require that the solution should have rank smaller than a certainprescribed maximal rank. This maximal rank rmax is determined by the computa-tional resources available and not necessarily by the underlying data. As the rank hasa strong influence on the complexity of the algorithm, see Section 4.7, this motivatesthe following rank-adaptive procedure. First, we run RTTC with r(1) = (1, 1, . . . , 1)to obtain a result X(1). Then, we add a rank-1 correction term R and then run RTTCagain with starting guess X(1) + R with r(2) = rank(X(1) + R) to obtain X(2). Anoptimal rank correction would be given by the minimizer of

minR∈R

n1×···×nd

rank(R)=(1,...,1)

f(X(1) + R),

which is not computationally feasible to compute exactly. Instead, we can chooseR to be a rank-1 approximation of the negative Euclidean gradient −∇f(X(1)), sothat we perform one step of steepest descent to obtain X(2), see [47] for a discussionof such greedy rank-1 updates in the context of matrix completion. In our case,the high-dimensionality makes a global update very costly. Instead, we can performa local update to only increase one rank-index at a time, cycling through each rµ,µ = 1, . . . , d− 1. To modify the rank rµ, we set

UL

µ ←[UL

µ Rµ

], UR

µ+1 ←[UR

µ+1

Rµ+1

]. (4.7)

When we choose Rµ ∈ Rrµ−1nµ×1 and Rµ+1 ∈ R

1×nµ+1rµ+1 from a local projectionof the approximated gradient, we obtain an AMEn-like procedure [13]. While verysuccessful in the case of linear systems, we have found that choosing Rµ and Rµ+1 asrandom vectors with small magnitude, say, 10−8, performs just as well. The resultingrank-adaptive algorithm is shown in Algorithm 3. This procedure is repeated untileither the prescribed residual tolerance is fulfilled or r = rmax is reached.

As the main goal of the first steps is to steer the iterates into the right direction,only few steps of RTTC are required at each level. We make use of the stagnationcriterion (4.4) with a crude tolerance of, say, δ = 0.01. With this choice, RTTCusually needs less than 10-15 iteration steps at each level. At the final rank, a finertolerance is used to allow for more iterations.

In the worst-case scenario, this leads to (d−1)rmax executions of RTTC. To limitthe occurances of unnecessary rank increases, it can be helpful to revert the rank inthe current mode if the increase did not improve on εΩC

. To detect such cases, wemeasure the relative change from εΩC

(X) to εΩC(Xnew). If it did improve, we always

accept the step; additionally, the parameter ρ ≥ 0 allows for slight increases of theerror in one rank-increasing step, with the hope that it will decrease in later steps. Inthe numerical experiments, we chose ρ = 1.


Algorithm 3 Adaptive rank adjustment

Input: Sampled data PΩ A, maximal rank rmax, acceptance parameter ρOutput: Completed tensor X of TT-rank r, rµ < rmax,µ.

X random tensor, r := rankTT(X) = (1, 1, . . . , 1)T

X← Result of RTTC using X as starting guess.

for k = 2, . . . , rmax do

for µ = 1, . . . , d− 1 do

if rµ ≥ rmax,µ then

µth rank is maximal. Skip.else

Xnew ← Increase µth rank of X to (1, r1, . . . , rµ + 1, . . . rd−1, 1) using (4.7)Xnew ← Result of Algorithm 1 using Xnew as starting guess.

if (εΩC(Xnew)− εΩC

(X)) > ρεΩC(X) then

Revert step.else

X← Xnew % accept step.

end if

end if

end for

end for

As a side remark, we mention that a very different approach to introduce rank-adaptivity could try to replace the retraction step by a rounding procedure, whichdoes not truncate to a fixed rank, but adjusts the rank according to some prescribedaccuracy. As the step Xk +αηk has up to rank 2r, see Remark 1, this would allow topossibly double the rank in each step. We have performed a few tests in this direction,but could not achieve good performance.

5. Numerical experiments. In the following section we investigate the per-formance of the proposed tensor completion algorithm and compare it to existingmethods, namely HTOpt [10, 11], ADF [17] and a simple alternating optimizationscheme, see Section 5.2.

To quantify the reconstruction quality of the algorithms, we measure the relativeerror on a test set Γ different from Ω and ΩC . All sampling sets are created byrandomly sampling each index iµ from a uniform distribution on 1, . . . , nµ. Wesample more entries than necessary, such that after removing duplicates we end upwith a sampling set of the desired size.

As mentioned in the introduction, cross approximation techniques can be used ifthe sampling set Ω is not prescribed, but can be choosen freely. This is often the casewhen dealing with the approximation of high-dimensional functions and solutions ofparameter-dependent PDEs. To show the effectiveness of tensor completion in theseapplications, we compare RTTC to state-of-the-art cross-approximation algorithmsbased on the TT [14] and HT format [4, 20].

We have implemented RTTC in Matlab based on the TTeMPS toolbox, see http://anchp.epfl.ch/TTeMPS. As shown in Section 4, the calculation of the cost function,the gradient and the linearized linesearch involve operations on sparse tensors. Toobtain an efficient algorithm we have implemented these crucial steps in C using theMex-function capabilities of Matlab with direct calls to BLAS routines wherever

16 M. STEINLECHNER

possible. For example, when evaluating the cost function, see Section 4.1, permutingthe storage of the cores of size rµ × nµ × rµ+1 to arrays of size rµ × rµ+1 × nµ beforestarting the sampling procedure allows us to continuously align the matrix blocks inmemory for fast calls to the BLAS routine DGEMV.

While the computational complexity, Section 4.7, suggests that most parts ofthe algorithm are equally expensive, the majority of the wall-time is spent on thecalculation of the gradient, see Section 4.2, as it involves |Ω| operations on singleindices. Inevitably, this leads to unaligned memory accesses which are a bottleneck ontoday’s computers. This can be partly avoided by embarrassingly parallel distributionof the loop over all indices in the sampling set Ω, but is out of the scope of this article.

All timings have been conducted on an Intel Xeon E31225, 3.10GHz quad-coreprocessor with 8 GB of memory, running Matlab version 2012b.

5.1. Scaling of the algorithm. As a first experiment, we check the computa-tional complexity of RTTC derived in Section 4.7. In Figure 5.1, we show three plotscorresponding to the scaling with respect to (left) the tensor size n, (middle) the num-ber of dimensions d and (right) the tensor rank r. In all cases, the original data is aTT tensor of known rank r whose cores consist of entries uniformly, randomly chosenfrom [0, 1]. In the first case (left), we have chosen d = 10, r = (1, 10, . . . , 10, 1), andmode size n ∈ 20, 40, . . . , 200. In (middle), we have set the mode size to n = 100,r = (1, 10, . . . , 10, 1), and dimensions ranging from d = 3 to d = 20. Finally, for(right) we set d = 10, n = 100 and r = (1, r, . . . , r, 1) with r ∈ 2, . . . , 25. For eachconfiguration, we run 10 steps of RTTC with 5 repetitions. The sampling set size |Ω|is chosen such that it scales like O(nr2). The lines are linear fits for n and d and aquartic fit for tensor rank r. We observe that RTTC indeed scales linearly with n andd and to fourth order in r (due to the scaling of Ω), in accordance with (4.5). Notethat if we scale Ω like O(dnr2), then the timing would depend quadratically on d andnot linearly.

50 100 150 200

1

2

3

Mode size n

Tim

e[s]

5 10 15 200

1

2

3

Dimensions d

Tim

e[s]

5 10 20 30 40

1

10

100

1000

Tensor rank r

Tim

e[s]

Fig. 5.1: Scaling of RTTC with respect to tensor size n, number of dimensions d andtensor rank r. In the rightmost figure, in log-log scale, the line corresponds to a fourthorder dependence on the tensor rank. See Section 5.1 for details.

5.2. Comparison to an alternating linear scheme. To evaluate the perfor-mance of our proposed algorithm, we have implemented a simple alternating linearscheme (ALS) algorithm to approach the tensor completion problem (1.1). Thisapproach was suggested by Grasedyck, Kluge and Kramer [18]. In the TT format,algorithms based on ALS-type optimization are a simple but surprisingly effective way


0 20 40 6010−15

10−10

10−5

100

Iteration

Rel.erroronΓ

ALS, r = 4

ALS, r = 8

ALS, r = 12

RTTC, r = 4

RTTC, r = 8

RTTC, r = 12

100 101 102 103 10410−15

10−10

10−5

100

Time [s]

Rel.erroronΓ

Fig. 5.2: Comparison between the convergence of RTTC and ALS. The underlying ten-sor is a random TT tensor of size n = 50, d = 10 and known rank r ∈ 4, 8, 12. Left:Rel. error with respect to number of iterations. One iteration of ALS corresponds toone half-sweep. Right: Rel. error with respect to time.

to solve an optimization scheme. In each step, all cores but one are kept fixed andthe optimization problem is reduced to a small optimization problem of a single core.The core is replaced by its locally optimal solution and fixed, while the optimizationscheme moves to the neighboring core. This approach has been successfully used tosolve linear systems and eigenvalue problems in the TT format, see e.g. [12, 21, 38].

For the tensor completion problem (1.1), one single core optimization step of theALS scheme is given by

Uµ = argminV

1

2

∑

i∈Ω

[A(i1, . . . , id)− U1(i1) · · ·Uµ−1(iµ−1)V (iµ)Uµ+1(iµ+1) · · ·Ud(id)

]2

= argminV

1

2

∑

i∈Ω

[A(i1, . . . , id)−X≤µ−1(i1, . . . iµ−1)V (iµ)X≥µ+1(iµ+1, . . . , id)

]2.

Cycling through all cores Uµ for µ = 1, . . . , d multiple times until convergence yieldsthe full ALS completion scheme. To solve one core optimization step efficiently, welook at the iµth slice of the new core Uµ, making use of notation (4.2):

Uµ(iµ) = argminV

1

2

∑

j∈Θiµ

[A(j1, . . . , jd)

−X≤µ−1(j1, . . . jµ−1)V (jµ)X≥µ+1(iµ+1, . . . , id)]2

= argminV

1

2

∑

j∈Θjµ

[A(j1, . . . , jd)

−(X≥µ+1(jµ+1, . . . , jd)T ⊗X≤µ−1(j1, . . . jµ−1)

)vec (V (jµ))

]2

which is a linear least squares problem for vec(V (iµ)) with a system matrix of size|Θiµ | × rµ−1rµ. As

∑nµ

iµ=1 |Θiµ | = |Ω|, solving the least squares problems to ob-

tain the new core Uµ can be performed in O(|Ω|r4) operations. Updating each core

18 M. STEINLECHNER

µ = 1, . . . , d once (one so-called half-sweep of the ALS procedure) then results in atotal cost of O(d|Ω|r4) operations. Compared to the complexity of RTTC, this ALSprocedure involves the fourth power of the rank instead of the square. In Figure 5.2,we compare the convergence of the two approaches as a function of iterations andtime when reconstructing a random TT tensor of size n = 50, d = 10 and knownrank r chosen from 4, 8, 12. While the iteration count is similar, the ALS approachtakes around an order magnitude more time to converge, slowing down considerablyas the rank of the underlying data increases. To make sure that the implementationis similarly optimized as RTTC, the setup of the least squares system matrices isperformed in optimized Mex-functions, written in C.

100 150 200 250 300 350 400

1

1.5

2

2.5

3

3.5

4

4.5

5

5.5·104

Tensor size n

SizeofsamplingsetΩ

100 150 200 250 300 350 400

1

1.5

2

2.5

3

3.5

4

4.5

5

5.5·104

Tensor size n

SizeofsamplingsetΩ

(a) Phase plot for d = 5. Left: RTTC. Right: ALS.

100 150 200 250 300 350 400

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6·105

Tensor size n

SizeofsamplingsetΩ

100 150 200 250 300 350 400

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6·105

Tensor size n

SizeofsamplingsetΩ

(b) Phase plot for d = 10. Left: RTTC. Right: ALS.

Fig. 5.3: Phase plots for RTTC and ALS for (a) d = 5 and (b) d = 10, showingthe ability to recover the original data as a function of the mode size n and the sizeof the sampling set Ω. The underlying tensor is a random TT tensor of known rankr = (1, 3, . . . , 3, 1). White means convergent in all 5 runs, black means convergent inno runs. The white dashed curve corresponds to 20n log(n) for d = 5 and 80n log(n)for d = 10.

5.3. Reconstruction of random tensors. In Figure 5.3 phase plots are de-picted, showing the ability of RTTC and the ALS procedure to recover a tensor as afunction of the number of known values of the original tensor. The results are shownfor two different numbers of dimensions, d = 5 and d = 10. We run both algorithms 5


times for each combination of tensor size n and sampling set size |Ω|. The brightnessrepresents how many times the iteration converged, where white means 5 out of 5converged and black corresponds to no convergence. We call the iteration convergentif the relative error on the test set Γ is smaller than 10−4 after at most 250 iterations.Note that in almost all convergent cases, the algorithms have reached this prescribedaccuracy within less than 100 steps.

The original data A is constructed as a TT tensor of rank r = (1, 3, . . . , 3, 1) withthe entries of each core being uniform random samples from [0, 1]. The question ofhow many samples are required to recover the underlying data has been a very activetopic in the last years. For the matrix case (d = 2) it has been shown that aroundO(nr log(n)) samples are necessary [9, 23], where n is the matrix size and r the rankof the underlying data. Up to this date, these results could not be extended to thetensor case, see e.g. [37] for a discussion. Some approaches, such as [32] were able

to prove that O(r⌊d2⌋n⌈ d

2⌉) samples suffice, but numerical experiments, such as in [25]

for d = 3, clearly suggest a much better scaling behaviour similar to the matrix case.In Figure 5.3 we have included a white dashed curve corresponding to 20n log(n) ford = 5 and 80n log(n) for d = 10.

5.4. Interpolation of high-dimensional functions. In the next section, wewill investigate the application of tensor completion to discretizations of high dimen-sional functions. The random tensors considered in the previous Section 5.3 areconstructed to be of a certain well-defined rank. For discretizations of function data,we can, in general, only hope for a sufficiently fast decay of the singular values in eachmatricization, resulting in good approximability by a low-rank tensor. As the rank isnot known a priori, we will employ the rank-increasing strategy presented in Section4.9, Algorithm 3, for all following experiments, both for RTTC and ALS.

5.4.1. Comparison to ADF. We compare the accuracy and timings of RTTCand the ALS procedure to an alternating direction fitting by Grasedyck et al. [17] fordifferent values of the maximal rank. We reconstruct the setup in [17, Section 4.2]for RTTC and the ALS scheme to compare against the therein stated results. Theoriginal data tensor of dimension d = 8 and mode sizes n = 20 is constructed by

A(i1, i2, . . . , id) =1√∑nµ=1 i

2µ

,

The results are shown in Table 5.1 for different values of the maximal rank r :=maxµ rµ. Both sampling set Ω and test set Γ are chosen to be of size |Ω| = |Γ| =10dnr2.

The results are shown in Table 5.1 for different values of the maximal rank r :=maxµ rµ. We can see that the reconstruction quality is similar for all algorithms,but the ADF scheme, also implemented in Matlab, suffers from exceedingly longcomputation times. A large part of these timing differences may be attributed to thefact that it does not use optimized Mex routines for the sampling. Therefore, wefocus mostly on the achieved reconstruction error and only include the timings as areference.

For the ALS procedure, the quartic instead of quadratic scaling of the complexitywith respect to the rank r becomes noticable.

5.4.2. Comparison to HTOpt. We now compare the reconstruction perfor-mance to a Riemannian optimization scheme on the manifold of tensors in the Hier-archical Tucker format, HTOpt [10, 11].

20 M. STEINLECHNER

ADF RTTC ALSr εΓ(Xend) time εΓ(Xend) time εΓ(Xend) time

2 1.94 · 10−2 14 s 6.42 · 10−3 11.1 s 7.55 · 10−3 5.90 s3 2.05 · 10−3 1.3 min 5.60 · 10−3 25.4 s 5.63 · 10−3 19.1 s4 1.72 · 10−3 24 min 1.73 · 10−3 45.7 s 1.50 · 10−3 46.5 s5 1.49 · 10−3 39 min 1.53 · 10−3 1.3 min 9.49 · 10−4 1.7 min6 2.25 · 10−3 1.7 h 7.95 · 10−4 2.1 min 1.09 · 10−3 3.3 min7 8.53 · 10−4 2.6 h 3.49 · 10−4 3.3 min 3.92 · 10−4 6.4 min

Table 5.1: Comparison of the reconstruction accuracy of RTTC to an ADF completionalgorithm [17] for different values of the maximal rank. The underlying data is adiscretized function resulting in a tensor of dimension d = 8 and mode size n = 20,see Section 5.4.1. The results and timings for ADF are taken from [17].

HTOpt ADF RTTC ALS|Ω|/nd εΓ(Xend) time εΓ(Xend) time εΓ(Xend) time εΓ(Xend) time

0.001 1.15 · 100 28.4 s 6.74 · 10−2 53.0 s 9.17 · 10−2 4.99 s — —0.005 1.67 · 100 28.8 s 6.07 · 10−3 3.2 m 6.79 · 10−3 5.11 s 2.19 · 10−2 3.87 s0.01 1.99 · 100 29.7 s 3.08 · 10−2 6.4 m 1.19 · 10−3 5.72 s 1.59 · 10−2 4.69 s0.05 3.68 · 10−3 29.6 s 1.95 · 10−5 18 m 2.13 · 10−4 6.42 s 5.64 · 10−4 6.94 s0.1 1.99 · 10−3 29.3 s 6.50 · 10−5 27 m 2.29 · 10−5 7.58 s 2.09 · 10−5 9.95 s

Table 5.2: Comparison of the reconstruction accuracy of RTTC to ALS, ADF anda Riemannian optimization scheme using the Hierarchical Tucker format (HTOpt)for different sizes of the sampling set. The underlying data is a discretized functionresulting in a tensor of dimension d = 4 and mode size n = 20, see Section 5.4.2.

This algorithm is conceptually very similar to RTTC and the Riemannian tensorcompletion algorithm in the Tucker format GeomCG [26], but uses a different tensorformat and a Gauss-Newton instead of a conjugate gradient scheme. We use the mostrecent HTOpt version 1.1 which is available from the authors. This implementation isoptimized for very high sampling rates, where up to 50% of the original data is given.Motivated by this setting, the involved matrices and tensors cannot be consideredsparse anymore and are treated by dense linear algebra. On the other hand, thisapproach limits the applicability of the algorithm to relatively low-dimensional tensors,say d = 4, 5. To make a fair comparison, we have adapted the example synthetic.m

in HTOpt, which also aims to reconstruct a tensor originating from the discretizationof a high-dimensional function, with the parameters left at their default value.

As this code works with a different tensor format, prescribed tensor ranks are ingeneral not directly comparable. To circumvent this, we have chosen d = 4, as in thiscase, the TT-rank r = (1, r1, r2, r3, 1) directly corresponds to a HT dimension treewith internal rank r2 and leaf ranks r1 and r3. We discretize the function

f : [0, 1]4 → R, f(x) = exp(−‖x‖)

using n = 20 equally spaced discretization points on [0, 1] in each mode. The maxi-mum rank is set to r = (1, 5, 5, 5, 1).

In Table 5.2 we present the results for different sizes of the sampling set |Ω| andthe corresponding relative error on the test set Γ, |Γ| = 100. Sampling and test sets arechosen identically for all algorithms. Since this example is not covered in [17], we now


use the reference implementation of ADF provided by the authors. We observe thatthe proposed RTTC is the fastest and yields better reconstruction than HTOpt. Forsampling ratios 0.001, 0.005, 0.01, HTOpt fails to recover the original data within thespecified number of maximum iterations, 250. For the smallest sampling ratio 0.001,the ALS procedure does not have enough samples available to solve the least squaresproblems. Altogether, RTTC, ALS and ADF perform very similar when comparingthe reconstruction quality. We assume that the reason for the worse performance ofHTOpt is due to the missing rank-adaptivity. For high sampling ratios, the use of arank adaptation scheme such as the one described in Section 4.9 is less important, asoverfitting is very unlikely to occur.

5.4.3. Parameter-dependent PDE: cookie problem. As an example of aparameter-dependent PDE, we investigate the cookie problem [44, 4]: Consider theheat equation on the unit square D = [0, 1]2 with d = m2 disks Ds,t of radius ρ = 1

4m+2and midpoints (ρ(4s− 1), ρ(4t− 1)) for s, t = 1, . . . ,m. A graphical depiction of thissetup is shown in Figure 5.4 for the cases m = 3 and m = 4. The heat conductivitiesof the different disks Ds,t are described by the coefficient vector p = (p1, . . . , pd),yielding the piecewise constant diffusion coefficient

a(x, p) :=

pµ, if x ∈ Ds,t, µ = m(t− 1) + s,

1, otherwise.

with each pµ ∈ [ 12 , 2]. The temperature u(x, p) at position x ∈ D for a certain choiceof parameters p ∈ [ 12 , 2]d is then described by the diffusion equation

0 1

1

D

D1,2

(a) 9 cookies (m = 3)

0 1

1

(b) 16 cookies (m = 4)

Fig. 5.4: Graphical depiction of the cookie problem for values m = 3 and m = 4resulting in 9 and 16 cookies, respectively.

− div(a(x, p)∇u(x, p)) = 1, x ∈ D,

u(x, p) = 0, x ∈ ∂D.(5.1)

Assume now that we are interested in the average temperature u(p) : [ 12 , 2]d → R overthe domain D,

u(p) :=

∫

[0,1]2u(x, p) dx.

Following the setup of [4], we discretize the parameter space [ 12 , 2]d by spectral colloca-tion using n = 10 Chebyshev points for each mode µ = 1, . . . d. Hence, calculating the

22 M. STEINLECHNER

TT-Cross HT-Cross (from [4])Tol. εΓ(X) Eval. εΓ(X) Eval.

10−3 8.35 · 10−4 11681 3.27 · 10−4 154810−4 2.21 · 10−5 14631 1.03 · 10−4 278410−5 1.05 · 10−5 36291 1.48 · 10−5 322410−6 1.00 · 10−6 42561 2.74 · 10−6 533810−7 1.31 · 10−7 77731 1.88 · 10−7 9475

RTTCεΓ(X) |Ω|

9.93 · 10−5 15488.30 · 10−6 27846.26 · 10−6 32246.50 · 10−7 53381.64 · 10−7 9475

(a) 9 cookies (m = 3)

TT-Cross HT-Cross (from [4])Tol. εΓ(X) Eval. εΓ(X) Eval.

10−3 8.17 · 10−4 22951 3.98 · 10−4 295910−4 3.93 · 10−5 68121 2.81 · 10−4 526110−5 9.97 · 10−6 79961 1.27 · 10−5 832010−6 1.89 · 10−6 216391 3.75 · 10−6 1273610−7 — — 3.12 · 10−7 26010

RTTCεΓ(X) |Ω|

2.84 · 10−4 29592.10 · 10−5 52611.07 · 10−5 83201.89 · 10−6 127367.12 · 10−7 26010

(b) 16 cookies (m = 3)

Table 5.3: Reconstruction results for the cookie problem using cross-approximationand tensor completion. The last entry in Table (b) for TT-Cross has been omitted dueto an exceedingly high amount of function evaluations. The results for HT-Cross havebeen taken from the corresponding paper [4].

average of the solution of (5.1) for each collocation point in the parameter space wouldresult in a solution tensor A of size 10d. As this represents an infeasible amount ofwork, we instead try to approximate the solution tensor by a low-rank approximationX.

A common tool to tackle such problems are cross-approximation techniques. InTable 5.3 we compare the results obtained by a recent algorithm in the hierarchicalTucker format [4], denoted by HT-Cross, to the most recent cross-approximationtechnique in the TT format [39, 14], denoted by TT-Cross, the routine amen cross

from the TT-Toolbox [35]. The results for HT-Cross are taken from the correspondingpaper [4]. The specified tolerance is shown along with the relative reconstruction erroron a test set Γ of size 100 and the number of sampling points. Both algorithms are ableto construct a low-rank approximation X up to the prescribed accuracy, while TT-Cross needs about 10 times more sampling points. This could be due to an additionalsearch along the modes of the solution tensor A, as we have a mode size of n = 10 inour example. At each sampling point, the PDE (5.1) is solved for the correspondingparameter set p = (p1, . . . , pd) using the FEniCS [29] finite element package, version1.4.0.

To compare the performance of these two cross-approximation algorithms to ourtensor completion approach, we now try to complete A by using the same amount ofsampling points as the HT-Cross algorithm, but uniformly distributed. The test setΓ is the same as for TT-Cross. RTTC is able to recover A with similar accuracy.

We observe that for this problem, an adaptive choice of sampling points doesnot seem to perform better than a random sampling approach using tensor comple-tion. The tensor completion approach has the added advantage that all sampling


points are determined a priori. Thus, the expensive evaluation of the sampling pointscan be easily distributed to multiple computers in an embarrassingly parallel waybefore starting the completion procedure. In the adaptive cross-approximation, thenecessary sampling points are only determined at run-time, preventing an effectiveparallelization.

6. Conclusion. In this paper, we have derived a tensor completion algorithmcalled RTTC which performs Riemannian optimization on the manifold of TT tensorsof fixed rank. We have shown that our algorithm is very competitive and significantlyfaster when compared to other existing tensor completion algorithms. For function-related tensors, a rank increasing scheme was shown to be essential to be able torecover the original data from very few samples. In our test case, tensor completionwith randomly chosen samples yielded similar reconstruction errors as an adaptivecross-approximation approach with the same number of samples.

Theoretical lower bounds for the recoverability of low-rank tensors from few sam-ples remain a challenging open question.

Acknowledgments. The author would like to thank Jonas Ballani, Daniel Kress-ner and Bart Vandereycken for helpful discussions on this paper and Melanie Klugeand Sebastian Kramer for providing an implementation of the ADF algorithm.

REFERENCES

[1] P.-A. Absil, R. Mahony, and R. Sepulchre, Optimization algorithms on matrix manifolds,Princeton University Press, Princeton, NJ, 2008.

[2] P.-A. Absil and J. Malick, Projection-like retractions on matrix manifolds, SIAM J. ControlOptim., 22 (2012), pp. 135–158.

[3] P.-A. Absil and I. Oseledets, Low-rank retractions: a survey and new results, ComputationalOptimization and Applications, (2014).

[4] J. Ballani and L. Grasedyck, Hierarchical tensor approximation of output quantities ofparameter-dependent PDEs, tech. report, ANCHP, EPF Lausanne, Switzerland, March2014.

[5] J. Ballani, L. Grasedyck, and M. Kluge, Black box approximation of tensors in hierarchicalTucker format, Linear Algebra Appl., 438 (2013), pp. 639–657.

[6] M. Bebendorf, Approximation of boundary element matrices, Numer. Math., 86 (2000),pp. 565–589.

[7] M. Bebendorf and S. Rjasanow, Adaptive low-rank approximation of collocation matrices,Computing, 70 (2003), pp. 1–24.

[8] S. Borm and L. Grasedyck, Hybrid cross approximation of integral operators, Numer. Math.,101 (2005), pp. 221–249.

[9] E. J. Candes and T. Tao, The power of convex relaxation: Near-optimal matrix completion,IEEE Trans. Inform. Theory, 56 (2009), pp. 2053–2080.

[10] C. Da Silva and F. J. Herrmann, Hierarchical tucker tensor optimization – applications totensor completion, SAMPTA, 2013.

[11] , Hierarchical Tucker tensor optimization – applications to tensor completion, LinearAlgebra Appl., 481 (2015), pp. 131–173.

[12] S. V. Dolgov and I. V. Oseledets, Solution of linear systems and matrix inversion in theTT-format, SIAM J. Sci. Comput., 34 (2012), pp. A2718–A2739.

[13] S. V. Dolgov and D. V. Savostyanov, Alternating minimal energy methods for linear systemsin higher dimensions, SIAM J. Sci. Comput., 36 (2014), pp. A2248–A2271.

[14] , Alternating minimal energy methods for linear systems in higher dimensions, SIAMJournal on Scientific Computing, 36 (2014), pp. A2248–A2271.

[15] S. Gandy, B. Recht, and I. Yamada, Tensor completion and low-n-rank tensor recovery viaconvex optimization, Inverse Problems, 27 (2011), p. 025010.

[16] S. A. Goreinov, E. E. Tyrtyshnikov, and N. L. Zamarashkin, A theory of pseudoskeletonapproximations, Linear Algebra Appl., 261 (1997), pp. 1–21.

[17] L. Grasedyck, M. Kluge, and S. Kramer, Alternating directions fitting (ADF) of hierarchi-cal low rank tensors, DFG SPP 1324 Preprint 149, 2013.

24 M. STEINLECHNER

[18] . Personal Communication, 2015.[19] L. Grasedyck, D. Kressner, and C. Tobler, A literature survey of low-rank tensor approx-

imation techniques, GAMM-Mitt., 36 (2013), pp. 53–78.[20] L. Grasedyck, R. Kriemann, C. Lobbert, A. Nagel, G. Wittum, and K. Xylouris, Paral-

lel tensor sampling in the hierarchical Tucker format, Comput. Vis. Sci., 17 (2015), pp. 67–78.

[21] S. Holtz, T. Rohwedder, and R. Schneider, The alternating linear scheme for tensor opti-mization in the tensor train format, SIAM J. Sci. Comput., 34 (2012), pp. A683–A713.

[22] , On manifolds of tensors of fixed TT-rank, Numer. Math., 120 (2012), pp. 701–731.[23] R. H. Keshavan, A. Montanari, and S. Oh, Matrix completion from noisy entries, JMLR,

11 (2010), pp. 2057–2078.[24] B. N. Khoromskij, I. V. Oseledets, and R. Schneider, Efficient time-stepping scheme for

dynamics on TT-manifolds, Tech. Report 24, MPI MIS Leipzig, 2012.[25] D. Kressner, M. Steinlechner, and A. Uschmajew, Low-rank tensor methods with subspace

correction for symmetric eigenvalue problems, SIAM J. Sci. Comput., 36 (2014), pp. A2346–A2368.

[26] D. Kressner, M. Steinlechner, and B. Vandereycken, Low-rank tensor completion byRiemannian optimization, BIT, 54 (2014), pp. 447–468.

[27] J. Liu, P. Musialski, P. Wonka, and J. Ye, Tensor completion for estimating missing valuesin visual data, in Computer Vision, 2009 IEEE 12th International Conference on, 2009,pp. 2114–2121.

[28] Y. Liu and F. Shang, An efficient matrix factorization method for tensor completion, IEEESignal Processing Letters, 20 (2013), pp. 307–310.

[29] A. Logg, K.-A. Mardal, and G. N. Wells, Automated Solution of Differential Equationsby the Finite Element Method, vol. 84 of Lecture Notes in Computational Science andEngineering, Springer-Verlag, Berlin, 2012.

[30] C. Lubich, I. Oseledets, and B. Vandereycken, Time integration of tensor trains, arXivpreprint 1407.2042, 2014.

[31] Y. Ma, J. Wright, A. Ganesh, Z. Zhou, K. Min, S. Rao, Z. Lin, Y. Peng, M. Chen, L. Wu,E. Candes, and X. Li, Low-rank matrix recovery and completion via convex optimiza-tion. Survey website, http://perception.csl.illinois.edu/matrix-rank/. Accessed: 22.April 2013.

[32] C. Mu, B. Huang, J. Wright, and D. Goldfarb, Square deal: Lower bounds and improvedrelaxations for tensor recovery, in Proceedings of the 31st International Conference onMachine Learning (ICML-14), 2014, pp. 73–81.

[33] J. Nocedal and S. J. Wright, Numerical Optimization, Springer Series in Operations Re-search, Springer, 2nd ed., 2006.

[34] I. V. Oseledets, Tensor-train decomposition, SIAM J. Sci. Comput., 33 (2011), pp. 2295–2317.[35] I. V. Oseledets, S. Dolgov, D. Savostyanov, and V. Kazeev, TT-Toolbox Version 2.3,

2014. Available at https://github.com/oseledets/TT-Toolbox.[36] I. V. Oseledets and E. E. Tyrtyshnikov, TT-cross approximation for multidimensional

arrays, Linear Algebra Appl., 432 (2010), pp. 70–88.[37] H. Rauhut, R. Schneider, and Z. Stojanac, Tensor completion in hierarchical tensor rep-

resentations, arXiv preprint 1404.3905, 2014.[38] T. Rohwedder and A. Uschmajew, On local convergence of alternating schemes for opti-

mization of convex problems in the tensor train format, SIAM J. Numer. Anal., 51 (2013),pp. 1134–1162.

[39] D. V. Savostyanov and I. V. Oseledets, Fast adaptive interpolation of multi-dimensionalarrays in tensor train format, in Proceedings of 7th International Workshop on Multidi-mensional Systems (nDS), IEEE, 2011.

[40] R. Schneider and A. Uschmajew, Convergence results for projected line-search methodson varieties of low-rank matrices via Lojasiewicz inequality, SIAM J. Optim., 25 (2015),pp. 622–646.

[41] M. Signoretto, L. De Lathauwer, and J. A. K. Suykens, Nuclear norms for tensors andtheir use for convex multilinear estimation, Tech. Report 10-186, K. U. Leuven, 2010.

[42] M. Signoretto, Q. Tran Dinh, L. De Lathauwer, and J. A. K. Suykens, Learning withtensors: a framework based on convex optimization and spectral regularization, MachineLearning, 94 (2014), pp. 303–351.

[43] M. Signoretto, R. Van de Plas, B. De Moor, and J. A. K. Suykens, Tensor versusmatrix completion: A comparison with application to spectral data, IEEE Signal ProcessingLetters, 18 (2011), pp. 403–406.

[44] C. Tobler, Low-rank Tensor Methods for Linear Systems and Eigenvalue Problems, PhD


thesis, ETH Zurich, Switzerland, 2012.[45] E. E. Tyrtyshnikov, Incomplete cross approximation in the mosaic-skeleton method, Com-

puting, 64 (2000), pp. 367–380. International GAMM-Workshop on Multigrid Methods(Bonn, 1998).

[46] A. Uschmajew and B. Vandereycken, The geometry of algorithms using hierarchical tensors,Linear Algebra Appl., 439 (2013), pp. 133–166.

[47] , Greedy rank updates combined with Riemannian descent methods for low-rank opti-mization, in Sampling Theory and Applications (SampTA), 2015 International Conferenceon, IEEE, 2015, pp. 420–424.

[48] B. Vandereycken, Low-rank matrix completion by Riemannian optimization, SIAM J. Optim.,23 (2013), pp. 1214—1236.

riemannian optimization for high-dimensional tensor...

Documents