tensor decomposition and its applications

This is a introduction slide for Tensor Decompositions and its applications on Data Mining.


Applications of tensor (multiway array) factorizations and

decompositions in data mining

機械学習班輪講 11/10/25



Mørup, M. (2011), Applications of tensor (multiway array) factorizations and decompositions in data mining.

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1: 24–

40. doi: 10.1002/widm.1


Table of Contents

1.全体のイントロダクション2.2階のテンソルと行列分解3.SVD: Singular Value Decomposition4.論文のイントロダクションと記法など5.TuckerモデルとCPモデルの紹介6.応用とまとめ


• (Wikipediaでは)テンソル(tensor)とは線形的な量または幾何概念を一般化したもの→乱暴に言えば多次元配列に相当する




2階のテンソル (1)

•定義: 2つの任意のベクトル に対して実数を対応させる関数で,任意のベクトル 及びスカラーkに対して以下の双線形性が成り立つ関数Tを2階のテンソルと呼ぶ

•ベクトルの内積 は関数Tになる.正規直交基底 上の3次元空間において,基底ベクトルに対するテンソルを とする

2階のテンソル (2)

• {i,j}の組合せに対応して任意のベクトルu, v




2階のテンソル (3)

•任意のベクトル を線形変換Tで変換させたベクトル の成分はTの行列表示 を用いて表現できる (普通の行列の積)




2階のテンソルの分解 (1)

データの世界 線形代数の世界




2階のテンソルの分解 (2)

• 2階のテンソルが行列で表現されるため,2階のテンソル分解は行列分解同値.そこで特異値分解(SVD:Singular Value Decomposition)を例題として見る.

• SVDは行列分解として基本的な操作の一つで,LSI(Latent Semantic Indexing)で利用されている.LSIでは各文書における単語の出現を表して行列を利用する.• 単語iが文書jに出現するとき,行列の(i,j)要素にその情報


• SVDはこの行列を用語と何らかの概念の関係及び概念と文書間の関係に変換する

Table of Contents

1.全体のイントロダクション2.2階のテンソルと行列分解3.SVD: Singular Value Decomposition4.論文のイントロダクションと記法など5.TuckerモデルとCPモデルの紹介6.応用とまとめ

SVD:Singular Value Decomposition

• SVDの考え方は次のようになる.例として文書と単語の出現に関する行列Xを考える.





SVD:Singular Value Decomposition

• SVDの考え方は次のようになる.例として文書と単語の出現に関する行列Xを考える.






SVD:Singular Value Decomposition

• SVDは(m,n)行列Aを に分解する

• U: (m,r)行列.行列 のr個の非零な固有ベクトル(左特異)から構成される.

• V: (n,r)行列.行列 のr個の非零な固有ベクトル(右特異)から構成される.

•Σ:(r,r)行列.特異値からなる対角行列.特異値は の非零な固有値の降順にr個.

• SVDは重要ではない次元を自動的に落としてくれる.そしてΣは元になった行列Aを上手く近似してくれるとされる.

octave-3.4.0:4> A = [1 2 3; 4 5 6; 7 8 9];octave-3.4.0:5> [u v d] = svd(A);

octave-3.4.0:6> u -0.21484 0.88723 0.40825 -0.52059 0.24964 -0.81650 -0.82634 -0.38794 0.40825

octave-3.4.0:7> v 1.6848e+01 0 0 0 1.0684e+00 0 0 0 1.4728e-16

octave-3.4.0:8> d -0.479671 -0.776691 0.408248 -0.572368 -0.075686 -0.816497 -0.665064 0.625318 0.408248

Table of Contents

1.全体のイントロダクション2.2階のテンソルと行列分解3.SVD: Singular Value Decomposition4.論文のイントロダクションと記法など5.TuckerモデルとCPモデルの紹介6.応用とまとめ


• (Wikipedia)テンソル(tensor)は,乱暴に言えば多次元配列に相当.2階のテンソル(行列)は2次元配列,3階のテンソルは3次元配列である.

• 2階テンソルの例のように,色んなデータが3次元以上の配列で集められているとする.そこでSVDのように,そのテンソルをいくつかの要素に分解して解釈したい.


テンソルに関する記法 (1)

• N階の実テンソルを と表し,各要素を と表す.簡単な例として3階のテンソル を考え,αを実数とする.• スカラー倍• テンソルの和• テンソルの内積

• フロベニウスノルム


行列化のイメージOverview wires.wiley.com/widm

FIGURE 1 | The matricizing operation on a third-order tensor of size 4 ! 4 ! 4.

whereas the Khatri–Rao product is defined as acolumn-wise Kronecker product

AI!J | " |B K!J = AI!J # B K!J = CIK!J ,

such that ck+K(i$1), j = ai j bkj . (10)

An important property when calculating the Moore–Penrose inverse (i.e., A† = (A% A)$1 A%) of Kroneckerand Khatri–Rao products are

(P " Q)† = (P† " Q†) (11)

(A # B)† = [(A% A)&(B% B)]$1(A # B)% (12)

where & denotes elementwise multiplication.This reduces the complexity from O(J 3L3)

to O(max{I J 2, K J 2, J 3, L3}) and O(IK J 2) toO(max{IK J , I J 2, K J 2, J 3}), respectively. For addi-tional properties of these matrix products see, alsoRef 28. In Table 1, a summary of the operatorsdescribed above can be found.

THE TUCKER ANDCANDECOMP/PARAFAC MODELSThe two most widely used tensor decompositionmethods are the Tucker model29 and Canonical De-composition (CANDECOMP)30 also known as ParallelFactor Analysis (PARAFAC)31 jointly abbreviated CP.In the following section, we describe the models for

TABLE 1 Summary of the Utilized Variables and Operations. X , X, x, and x are Used to DenoteTensors, Matrices, Vectors, and Scalars Respectively.

Operator Name Operation

'A,B( Inner product 'A,B( =!

i, j ,k ai, j ,kbi, j ,k)A)F Frobenius norm


X(n) Matricizing X I 1!I 2!...!I N + X I n!I 1·I 2···I n$1·I n+1···I N(n)

!n or •n n-mode product X !n M = Z where Z(n) = MX (n), outer product a , b = Z where zi, j = ai b j" Kronecker product A " B = Z where zk+K (i $1),l+L ( j $1) = ai j bkl# or | " | Khatri–Rao product A # B = Z, where zk+K (i $1), j = ai j bk j .

kA k-rank Maximal number of columns of A guaranteed to be linearly independent.

26 Volume 1, January /February 2011c! 2011 John Wi ley & Sons , Inc .





テンソルに関する記法 (2)

• n-mode積: テンソル と行列 のn-mode積は と表記する.定義は次のようになる.

行列化演算子を用いて行列AのSVD: はn-mode積を用いて


テンソルに関する記法 (3)



• Khatri-Rao積 (column-wise クロネッカー積)


Table of Contents

1.全体のイントロダクション2.2階のテンソルと行列分解3.SVD: Singular Value Decomposition4.論文のイントロダクションと記法など5.TuckerモデルとCPモデルの紹介6.応用とまとめ


• TuckerモデルとCPモデルは広く利用されているテンソル分解手法.論文では3階のテンソルの場合について説明する.

Tuckerモデル CPモデル


Tuckerモデル (1)

• Tuckerモデルは3階のテンソル を核配列(core-array) と3つのmodeに分ける.

n-mode積による定義WIREs Data Mining and Knowledge Discovery Applications of tensor (multiway array) factorizations and decompositions in data mining

FIGURE 2 | Illustration of the Tucker model of a third-order tensorX . The model decomposes the tensor into loading matrices with amode specific number of components as well as a core arrayaccounting for all multilinear interactions between the components ofeach mode. The Tucker model is particularly useful for compressingtensors into a reduced representation given by the smaller core array G.

a third-order tensor but they trivially generalize togeneral Nth order arrays by introducing additionalmode-specific loadings.

Tucker ModelThe Tucker model proposed in Ref 29 reads for athird-order tensor X I!J !K

X I!J !K "!


gl,m,naIl # bJ

m # cKn , such that

xi, j,k "!



where the so-called core array GL!M!N with elementsgl,m,n accounts for all possible linear interactions be-tween the components of each mode. To indicate howmany vectors pertain to each modality, it is customaryalso to denote the model a Tucker(L, M, N) model.Using the n-mode tensor product !n,29,32 the modelcan be written as

X I!J !K " GL!M!N !1 AI!L !2 B J !M !3 CK!N.

Each mode of the array is approximately spanned bygiven loading matrices for that mode such that thevectors of each modality interact with the vectors ofall remaining modalities with strengths given by thecore tensor G, see, also Figure 2.

The Tucker model is not unique. As such, multi-plying by invertible matrices QL!L, RM!M, and SN!N

gives an equivalent representation, i.e.,

X " (G !1 Q!2 R !3 S) !1 (AQ$1) !2 (B R$1))

!3(CS$1)) = "G !1 "A !2 "B !3 "C.

As a result, the factors of the unconstrained Tuckermodel can be constrained orthogonal or orthonormal(which is useful for compression) without hamper-ing the reconstruction error. However, imposing or-thogonality/orthonormalty does not resolve the lackof uniqueness as the solution is still ambiguous to

multiplication by orthogonal/orthonormal matricesQ, R, and S. Using the n-mode matricizing andKronecker product operation, the Tucker model canbe written as

X (1) " AG(1)(C % B)&

X (2) " BG(2)(C % A)&

X (3) " CG(3)(B % A)&.

The above decomposition for a third-order tensor isalso denoted a Tucker3 model, the Tucker2 modeland Tucker1 models are given by

Tucker2: X " G !1 A !2 B !3 I ,

Tucker1: X " G !1 A !2 I !3 I ,

where I is the identity matrix. Thus, the Tucker1model is equivalent to regular matrix decompositionbased on the representation X (1) = AG(1).

Model EstimationTraditionally, the Tucker model has been estimatedon the basis of updating the elements of each modein turn that for the least squares objective commonlyis denoted ALS. By fitting the model using ALS, theestimation reduces to a sequence of regular matrixfactorization problems. As a result, for least squaresminimization, the solution of each mode can be solvedby pseudoinverses, i.e.,

A ' X (1)(G(1)(C % B)&)†

B ' X (2)(G(2)(C % A)&)†

C ' X (3)(G(3)(B % A)&)†

G ' X !1 A† !2 B† !3 C†.

The analysis simplifies when orthogonality isimposed24 such that the estimation of the core can beomitted. Orthogonality can be imposed by estimatingthe loadings of each mode through the SVD formingthe Higher-order Orthogonal Iteration (HOOI),10,24


AS(1)V (1)& = X (1)(C % B),

B S(2)V (2)& = X (2)(C % A),

CS(3)V (3)& = X (3)(B % A).

such that A, B, and C are found as the first L, M, andN left singular vectors given by solving the right handside by SVD. The core array is estimated upon con-vergence by G ' X !1 A† !2 B† !3 C†. The aboveprocedures are unfortunately not guaranteed to con-verge to the global optimum.

A special case of the Tucker model is given bythe HOSVD29,32 where the loadings of each mode is

Volume 1, January /February 2011 27c! 2011 John Wi ley & Sons , Inc .

Tuckerモデル (2)

• Tuckerモデルは一意にならない. と核逆行列が存在する により


• Tuckerモデルはn-mode積と行列化演算子,クロネッカー積を用いて次のように書ける

• 3階テンソルを3方向全て真面目に分解しないモデルはTucker2/Tucker1モデルと呼ばれる

• Tuckerモデルの推定は各モード(mode)の成分を順番に更新していく.最小二乗法の目的関数を持つ場合,ALSと呼ばれる

Tuckerモデル (3)

• Tuckerモデルに直交性を課す条件がある.この条件は解析を簡素化させる


CPモデル (1)

• CPモデルはTuckerモデルの特別な場合として考案された.CPモデルでは核配列のサイズはどの次元も同じでL=M=N.分解は以下で定義.


• CPモデルはその制限によって一意な核を持つ正則行列


CPモデル (2)

• CPモデルの推定は次のようになるスケーリングのあいまい性を排除するために


Table of Contents

1.全体のイントロダクション2.2階のテンソルと行列分解3.SVD: Singular Value Decomposition4.論文のイントロダクションと記法など5.TuckerモデルとCPモデルの紹介6.応用とまとめ



constraints. Fast prototyping and handling of sparsemultiway arrays in Matlab is provided by theTensorToolbox.46 For additional software, see alsoRefs 4, 24.

TENSOR FACTORIZATION FOR DATAMININGThe first applications of Tensor decomposition waswithin the field of Psychology in the 1970s whenthe CP model was demonstrated to alleviate the ro-tational ambiguity in factor analysis, whereas theframework enabled to address higher order inter-actions. In 1981 Appellof and Davidson3 pioneeredthe use of the CP model in chemistry for the anal-ysis of fluorescence data, whereas Mocks47 demon-strated in 1988 how the CP model was useful in theanalysis of multisubject-evoked potentials of electro-encephalography (EEG) data by reinventing the modelunder the name topographic component model. Sincethen tensor decompositions have found wide use inpractically all fields of science ranging from signalprocessing, computer vision, bioinformatics to webmining. In many of these studies, it has been proventhat the use of tensor decomposition can explore re-lations and interactions between the modes of thedata that are lost when resorting to traditional matrixapproaches. In particular, tensor decomposition effi-ciently extract the consistent patterns of activationwhile giving an intuitive account of how the mea-surements of each mode interact. However, tensordecomposition has not only proven useful for redun-dancy reduction (i.e., compression) but also for manytypes of data proven to account well for the under-lying physics/dynamics of the system generating thedata. In the following, some of the key applicationsof tensor factorization in data mining is given acrossa multitude of scientific fields given more or less intheir historical order. This is in no way an exhaustiveaccount of the many applications of tensor decom-position; however, the examples given will hopefullydemonstrate some of the many benefits of multiwaymodeling for a variety of data and problem domains.

PsychologyThe first applications of CP was within the field ofpsychometrics in 1970 pioneered by the work of Car-roll and Chang30 and Harshman.31 Ref 30 introducedCanonical Decomposition in the context of analyzingmultiple similarity or dissimilarity matrices from a va-riety of subjects. They applied the method to one data

FIGURE 4 | Example of a Tucker(2, 3, 2) analysis of the chopindata X 24 Preludes!20 Scales!38 Subjects described in Ref 49. The overallmean of the data has been subtracted prior to analysis. Black andwhite boxes indicate negative and positive variables, whereas the sizeof the boxes their absolute value. The model accounts for 40.42% ofthe variation in the data, whereas the model on the same data randompermuted accounts for 2.41 ± 0.09% of the variation. As such, thedata are very structured and compressible by the Tucker model.

set on auditory tones from Bell Labs and to anotherdata set of comparisons of countries based on the ideathat simply averaging the data removed the differentaspects present in the data,31 introduced PARAFAC be-cause it eliminated the rotational ambiguity associ-ated with two-dimensional PCA and thus has betteruniqueness properties motivated by Cattells principleof parallel proportional profiles.48 PARAFAC was hereapplied to vowel-sound data where different individ-uals spoke different vowels and the formant (i.e., thepitch) was measured, i.e.,

X Subject!Vowel!Pitch



aSubjectd # bVowel

d # cPitchd . (18)

Since these initial works both the CP as well as Tuckermodel also referred to as N-mode PCA2 have hada widespread application within social and behav-ioral sciences addressing questions such as ‘Whichgroup of subjects behave differently on which vari-ables under which conditions?’2 In Figure 4 is givena Tucker(2, 3, 2) analysis of 24 chopin preludesbased on 20 types of scoring scales evaluated by 38judges/subjects,49 i.e.,

X Predude!Scales!Subject



gl,m,naPreludel # bScales

m # cSubjectn . (19)

The analysis extracts loadings that well span the dy-namics of each mode, whereas the core array accounts

32 Volume 1, January /February 2011c! 2011 John Wi ley & Sons , Inc .

WIREs Data Mining and Knowledge Discovery Applications of tensor (multiway array) factorizations and decompositions in data mining

FIGURE 7 | Left panel: Tutorial dataset two of ERPWAVELAB50 given by X 64 Channels!61 Frequency bins!72 Time points!11Subjects!2Conditions. Rightpanel a three component nonnegativity constrained three-way CP decomposition of Channel ! Time " Frequency ! Subject " Condition and athree component nonnegative matrix factorization of Channel ! Time " Frequency " Subject " Condition. The two models account for 60% and76% of the variation in the data, respectively. The matrix factorization assume spatial consistency but individual time-frequency patterns ofactivation across the subjects and conditions, whereas the three-way CP analysis impose consistency in the time-frequency patterns across thesubjects and conditions. As such, these most consistent patterns of activations are identified by the model.

down-weighted in the extracted estimates of the con-sistent event-related activations.

XChannel!Time!Trial #D!


aChanneld $ bTime

d $ cTriald (22)

Unfortunately, violation of multilinearity in thedata can cause degeneracy in the CP model, see alsoFigure 6. To avoid CP degeneracy, artificial restrictionsin the form of orthogonality have been imposed or al-ternatively the signals have been analyzed via purelyadditive models based on analysis of amplitudes in aspectral representation, see also Ref 6 and referencestherein. In Ref 6, these ad-hoc measures were foundunsatisfactory. Rather than restricting the CP model, apseudo-multilinear model using the unambiguous CP

model combined with a time-shift accounting for ex-plicit delays based on the shiftCP representation wasproposed. In Figure 6, it can be seen that account-ing for shift can indeed alleviate CP degeneracy whilethe consistent pattern of activations are identified, fordetails on the shiftCP approach, see also Ref 6 andreferences therein.

Signal ProcessingMultilinear algebra has recently gained a large interestwithin the signal processing community largely due toits applications in higher-order statistics (HOS).9–11,52

In the original work on independent component anal-ysis (ICA) by Comon,9 it was demonstrated how theblind source separation problem

X = AS + E (23)

such that S is statistically independent and E residualnoise can be solved through the CP decomposition ofsome higher-order cumulants due to the importantproperty that cumulants obey multilinearity.9,52 Thefirst-order cumulant corresponds to the mean and thesecond-order cumulant to the variance such that

E(X) = AE(S) + E(E) (24)

Cov(X) = ACov(S)A% + Cov(E) (25)

Where E(·) denotes expectation and Cov the covari-ance. For a general Nth-order cumulant, we have

K(N)X = K(N)

S !1 A !2 A ! · · · !N A + K(N)E (26)

where K(n)S is a diagonal matrix for independent S.

The ICA problem can potentially be uniquely solvedby identifying A in the symmetric CP decompositionof any cumulants of order N > 2, which for the third-or fourth-order cumulant is given by

K(3)X # D !1 A !2 A !3 A (27)

K(4)X # D !1 A !2 A !3 A !4 A, (28)

where D is a diagonal tensor. Generally speaking,it becomes harder to estimate cumulants from sam-ple data as the order increases, i.e., longer datasetsare required to obtain the same accuracy. Hence, inpractice, the use of higher-order statistics is usuallyrestricted to third- and fourth-order cumulants andbecause the third-order cumulants for symmetric dis-tributions are zero, fourth-order cumulants are oftenused.10

Volume 1, January /February 2011 35c! 2011 John Wi ley & Sons , Inc .


• 2階テンソル(行列)の分解が既にデータの理解や解析のため強力なツールになっていることから,3階以上のN階テンソルの解析も,今後重要な技術の1つになると考えられる



