a compressive classification framework for high-dimensional … · 2020-05-12 · muhammad naveed...

10
A COMPRESSIVE CLASSIFICATION FRAMEWORK FOR HIGH-DIMENSIONAL DATA 1 A Compressive Classification Framework for High-Dimensional Data Muhammad Naveed Tabassum Member, IEEE and Esa Ollila Member, IEEE We propose a compressive classification framework for settings where the data dimensionality is significantly higher than the sample size. The proposed method, referred to as compressive regularized discriminant analysis (CRDA) is based on linear discriminant analysis and has the ability to select significant features by using joint-sparsity promoting hard thresholding in the discriminant rule. Since the number of features is larger than the sample size, the method also uses state-of-the-art regularized sample covariance matrix estimators. Several analysis examples on real data sets, including image, speech signal and gene expression data illustrate the promising improvements offered by the proposed CRDA classifier in practise. Overall, the proposed method gives fewer misclassification errors than its competitors, while at the same time achieving accurate feature selection results. The open-source R package and MATLAB toolbox of the proposed method (named compressiveRDA) is freely available. Index Terms—Classification, covariance matrix estimation, discriminant analysis, feature selection, high-dimensional statistics, joint-sparse recovery I. I NTRODUCTION High-dimensional (HD) classification is at the core of numerous contemporary statistical studies. An increasingly common occurrence is the collection of large amounts of information on each individual sample point, even though the number of sample points themselves may remain relatively small. Typical examples are gene expression and protein mass spectrometry data, and other areas of computational biology. Regularization and shrinkage are commonly used tools in many applications such as regression or classification to overcome significant statistical challenges posed particularly due to the huge-dimension, low-sample-size (HDLSS) data settings in which the number of features, p, is often several magnitudes larger than the sample size, n (i.e., p n). In this paper, we propose a novel classification framework for such HD data. In many applications, the ideal classifier of an HD input vector should have three characteristics: (i) the proportion of correctly classified observations is high, (ii) the method is computationally feasible, allowing the training of the classifier in real-time with a personal computer, and (iii) the classifier selects only the significant features, thus yielding interpretable models. The last property is important for example in gene ex- pression studies, where it is important to identify which genes influence the trait (e.g., disease). This is obviously important for the biological understanding of the disease process but can help in the development of clinical tests for early diagnosis [1]. We propose a HD classification method that takes into account all the aforementioned three key aspects. Some of the standard off-the-shelf classification methods either lack feature elimination or require that it is done beforehand. For example, the support vector machine (SVM) classifier uses all features. Some other classifiers on the other hand have higher computational cost, e.g., Logit-ASPLS approach (with R-package: plsgenomics) [2] depends on three hyperparameters. This increases the computational complexity The research was supported by the Academy of Finland grant no. 298118 which is gratefully acknowledged M.N. Tabassum and E. Ollila are with Aalto University, Dept. of Signal Processing and Acoustics, P.O. Box 15400, FI-00076 Aalto, Finland (e-mail: {muhammad.tabassum, esa.ollila}@aalto.fi) of the method, especially when finer grids of hyperparameters are used and tuned using cross-validation (CV). In this paper, we propose a classification framework referred to as compres- sive regularized discriminant analysis (CRDA) method, which is computationally efficient and performs feature selection during the learning process. Our simulation studies and real data analysis examples illustrate that it often achieves superior classification performance compared to off-the-shelf classifi- cation approaches in the HDLSS settings. Many classification techniques assign a p-dimensional ob- servation x to one of the G classes (groups or populations) based on the following rule: x group ˜ g = arg max g d g (x) , (1) where d g (x) is called the discriminant function for population g ∈{1,...,G}. In linear discriminant analysis (LDA), d g (x) is a linear function of x, d g (x)= x > b g + c g , for some constant c g R and vector b g R p , g =1,...,G. The vector b g = b g (Σ) depends on the unknown covariance matrix Σ of the populations (via its inverse matrix) which is commonly estimated by the pooled sample covariance matrix (SCM). In the HD setting, the SCM is no-longer invertible, and therefore regularized SCM (RSCM) ˆ Σ is used for constructing an estimated discriminant function ˆ d g (x). Such approaches are commonly referred to as regularized LDA methods, which we refer shortly as RDA. See e.g., [3], [1], [4], [5], [6]. Next note that if the i-th entry of b g is zero, then the i-th feature does not contribute in the classification of g-th population. To eliminate unnecessary features, one can shrink b g using element-wise soft-thresholding as in the shrunken centroids (SC)RDA method [3]. These methods are often difficult to tune because the shrinkage threshold parameter is the same across all groups, but different populations would often benefit from different shrinkage intensity. Consequently, they tend to yield higher false-positive rates (FPR). Element-wise shrinkage does not achieve simultaneous fea- ture elimination as the eliminated feature from group i may still affect the discriminant function of group j . The proposed CRDA method uses joint-sparsity via q,1 -norm based hard- thresholding. Another advantage is of having a shrinkage arXiv:2005.04383v1 [stat.ML] 9 May 2020

Upload: others

Post on 25-Jul-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Compressive Classification Framework for High-Dimensional … · 2020-05-12 · Muhammad Naveed Tabassum Member, IEEE and Esa Ollila Member, IEEE We propose a compressive classification

A COMPRESSIVE CLASSIFICATION FRAMEWORK FOR HIGH-DIMENSIONAL DATA 1

A Compressive Classification Framework for High-Dimensional Data

Muhammad Naveed Tabassum Member, IEEE and Esa Ollila Member, IEEEWe propose a compressive classification framework for settings where the data dimensionality is significantly higher than the sample

size. The proposed method, referred to as compressive regularized discriminant analysis (CRDA) is based on linear discriminantanalysis and has the ability to select significant features by using joint-sparsity promoting hard thresholding in the discriminantrule. Since the number of features is larger than the sample size, the method also uses state-of-the-art regularized sample covariancematrix estimators. Several analysis examples on real data sets, including image, speech signal and gene expression data illustratethe promising improvements offered by the proposed CRDA classifier in practise. Overall, the proposed method gives fewermisclassification errors than its competitors, while at the same time achieving accurate feature selection results. The open-source Rpackage and MATLAB toolbox of the proposed method (named compressiveRDA) is freely available.

Index Terms—Classification, covariance matrix estimation, discriminant analysis, feature selection, high-dimensional statistics,joint-sparse recovery

I. INTRODUCTION

High-dimensional (HD) classification is at the core ofnumerous contemporary statistical studies. An increasinglycommon occurrence is the collection of large amounts ofinformation on each individual sample point, even though thenumber of sample points themselves may remain relativelysmall. Typical examples are gene expression and proteinmass spectrometry data, and other areas of computationalbiology. Regularization and shrinkage are commonly usedtools in many applications such as regression or classificationto overcome significant statistical challenges posed particularlydue to the huge-dimension, low-sample-size (HDLSS) datasettings in which the number of features, p, is often severalmagnitudes larger than the sample size, n (i.e., p� n). In thispaper, we propose a novel classification framework for suchHD data.

In many applications, the ideal classifier of an HD inputvector should have three characteristics: (i) the proportion ofcorrectly classified observations is high, (ii) the method iscomputationally feasible, allowing the training of the classifierin real-time with a personal computer, and (iii) the classifierselects only the significant features, thus yielding interpretablemodels. The last property is important for example in gene ex-pression studies, where it is important to identify which genesinfluence the trait (e.g., disease). This is obviously importantfor the biological understanding of the disease process but canhelp in the development of clinical tests for early diagnosis [1].We propose a HD classification method that takes into accountall the aforementioned three key aspects.

Some of the standard off-the-shelf classification methodseither lack feature elimination or require that it is donebeforehand. For example, the support vector machine (SVM)classifier uses all features. Some other classifiers on theother hand have higher computational cost, e.g., Logit-ASPLSapproach (with R-package: plsgenomics) [2] depends on threehyperparameters. This increases the computational complexity

The research was supported by the Academy of Finland grant no. 298118which is gratefully acknowledged

M.N. Tabassum and E. Ollila are with Aalto University, Dept. of SignalProcessing and Acoustics, P.O. Box 15400, FI-00076 Aalto, Finland (e-mail:{muhammad.tabassum, esa.ollila}@aalto.fi)

of the method, especially when finer grids of hyperparametersare used and tuned using cross-validation (CV). In this paper,we propose a classification framework referred to as compres-sive regularized discriminant analysis (CRDA) method, whichis computationally efficient and performs feature selectionduring the learning process. Our simulation studies and realdata analysis examples illustrate that it often achieves superiorclassification performance compared to off-the-shelf classifi-cation approaches in the HDLSS settings.

Many classification techniques assign a p-dimensional ob-servation x to one of the G classes (groups or populations)based on the following rule:

x ∈ group[g = arg max

gdg(x)

], (1)

where dg(x) is called the discriminant function for populationg ∈ {1, . . . , G}. In linear discriminant analysis (LDA), dg(x)is a linear function of x, dg(x) = x>bg + cg , for someconstant cg ∈ R and vector bg ∈ Rp, g = 1, . . . , G. Thevector bg = bg(Σ) depends on the unknown covariancematrix Σ of the populations (via its inverse matrix) which iscommonly estimated by the pooled sample covariance matrix(SCM). In the HD setting, the SCM is no-longer invertible, andtherefore regularized SCM (RSCM) Σ is used for constructingan estimated discriminant function dg(x). Such approaches arecommonly referred to as regularized LDA methods, which werefer shortly as RDA. See e.g., [3], [1], [4], [5], [6].

Next note that if the i-th entry of bg is zero, then thei-th feature does not contribute in the classification of g-thpopulation. To eliminate unnecessary features, one can shrinkbg using element-wise soft-thresholding as in the shrunkencentroids (SC)RDA method [3]. These methods are oftendifficult to tune because the shrinkage threshold parameter isthe same across all groups, but different populations wouldoften benefit from different shrinkage intensity. Consequently,they tend to yield higher false-positive rates (FPR).

Element-wise shrinkage does not achieve simultaneous fea-ture elimination as the eliminated feature from group i maystill affect the discriminant function of group j. The proposedCRDA method uses joint-sparsity via `q,1-norm based hard-thresholding. Another advantage is of having a shrinkage

arX

iv:2

005.

0438

3v1

[st

at.M

L]

9 M

ay 2

020

Page 2: A Compressive Classification Framework for High-Dimensional … · 2020-05-12 · Muhammad Naveed Tabassum Member, IEEE and Esa Ollila Member, IEEE We propose a compressive classification

A COMPRESSIVE CLASSIFICATION FRAMEWORK FOR HIGH-DIMENSIONAL DATA 2

parameter that is much easier to tune and interpret: namely, thejoint-sparsity level K ∈ {1, 2, . . . , p} instead of a shrinkagethreshold ∆ ∈ [0,∞) as in SCRDA. CRDA also uses tworecent state-of-the-art regularized covariance matrix estimatorsproposed in [7] and [8], [9], respectively. The first estimator,called the Ell-RSCM estimator [7], is designed to attain theminimum mean squared error and is fast to compute due toautomatic data-adaptive tuning of the involved regularizationparameter. The second estimator is a penalized SCM (PSCM)proposed by [8], which we refer to as Rie-PSCM, thatminimizes a penalized negative Gaussian likelihood functionwith a Riemannian distance penalty. Although a closed-formsolution of the optimization problem is not possible to find,the estimator can be computed using a fast iterative Newton-Raphson algorithm developed in [9].

We test the effectiveness of the proposed CRDA methodsusing partially synthetic genomic data as well as on nine realdata sets and compare the obtained results against seven state-of-the-art HD classifiers that also perform feature selection.The results illustrate that overall, the proposed CRDA methodgives fewer misclassification errors than its competitors whileachieving, at the same time, accurate feature selection withminimal computational efforts. Preliminary results of theCRDA approach has appeared in [10].

The paper is structured as follows. Section II describesthe regularized LDA procedure and the regularized covari-ance matrix estimators used. In Section III, CRDA is in-troduced as a compressive version of regularized LDA per-forming feature selection simultaneously with classification.Classification results for 3 different types (image, speech,genomics) of real data sets are also presented. Section IVillustrates the promising performance of CRDA in fea-ture selection for partially synthetic data set while Sec-tion V provides analysis results for six real gene expres-sion data sets against a large number of benchmark clas-sifiers. Section VI provides discussion and concludes thepaper. MATLAB and R toolboxes are available at http://users.spa.aalto.fi/esollila/crda and https://github.com/mntabassm/compressiveRDA).

Notations: We use ‖ · ‖F to denote the Frobenius (matrix)norm, defined as ‖A‖2F = tr(A>A) for all matrices A. An`q-norm is defined as

‖a‖q =

{(|a1|q + . . .+ |an|q

)1/q, for q ∈ [1,∞)

max(|ai| : i = 1, . . . , p

), for q =∞

for any vector a ∈ Rn. The soft-thresholding (ST)-operator isdefined as

S∆(b) = sign(b)(|b| −∆)+,

where (x)+ = max(x, 0) for x ∈ R and ∆ > 0 is a thresholdconstant. Finally, I(·) denotes the indicator function whichtakes value 1 when true and value 0 otherwise. E[·] denotesexpectation.

II. REGULARIZED LDA

We are given a p-variate random vector x which we needto classify into one of the G groups or populations. In LDA,

one assumes that class populations are p-variate multivariatenormal (MVN) with a common positive definite symmetriccovariance matrix Σ over each class but having distinct classmean vectors µg ∈ Rp, g = 1, . . . , G. The problem is thento classify x to one of the MVN populations, Np(µg,Σ),g = 1, . . . , G. Sometimes prior knowledge is available onproportions of each population and we denote by pg the priorprobabilities of the classes (

∑Gg=1 pg = 1). LDA uses the rule

(1) with discriminant function

dg(x) = x>bg −1

2µ>g bg + ln pg,

where bg = Σ−1µg for g = 1, . . . , G.The LDA rule involves a set of unknown parameters:

the class mean vectors {µg}Gg=1 and the covariance matrixΣ. These are estimated from the training data set X =(x1 · · · xn) that consists of ng observations from each ofthe classes (g = 1, . . . , G). Let yi denotes the class labelassociated with the i-th observation, so yi ∈ {1, . . . , G}. Thenng =

∑ni=1 I(yi = g) is the number of observations belonging

to g-th population, and we denote by πg = ng/n the relativesample proportions. We assume observations in the trainingdata set X are centered by the sample mean vectors of theclasses,

µg = xg =1

ng

∑yi=g

xi. (2)

Since the data matrix X is centered group-wise, the pooled(over groups) SCM can be written simply as

S =1

nXX>. (3)

In practice, an observation x is classified using an estimateddiscriminant function,

dg(x) = x>bg −1

2µ>g bg + lnπg, (4)

where bg = Σ−1

µg , g = 1, . . . , G and Σ is an estimator ofΣ. Note that in (4) the prior probabilities pg-s are replaced bytheir estimates, πg-s. Commonly, the pooled SCM S in (3) isused as an estimator Σ of the covariance matrix Σ.

Since p � n, the pooled SCM is no longer invertible. Toavoid the singularity of the estimated covariance matrix, acommonly used approach (cf. [11]) is to use a regularized SCM(RSCM) that shrinks the estimator towards a target matrix. Onesuch method is reviewed in subsection II-A. An alternativeapproach is to use penalized likelihood approach in whicha penalty term is added to the likelihood function with anaim to penalize covariance matrices that are far from thedesired target matrix. One prominent estimator of this type isreviewed in subsection II-B. There are many other covariancematrix estimators for high-dimensional data, including Bayesestimators [12], structured covariance matrix estimators [13],[14], robust shrinkage estimators [15], [16], [17], [18], etc,but most of these are not applicable in HDLSS setting due toincurred high computational cost.

Page 3: A Compressive Classification Framework for High-Dimensional … · 2020-05-12 · Muhammad Naveed Tabassum Member, IEEE and Esa Ollila Member, IEEE We propose a compressive classification

A COMPRESSIVE CLASSIFICATION FRAMEWORK FOR HIGH-DIMENSIONAL DATA 3

A. Regularized SCM estimatorsA commonly used RSCM estimator (see e.g., [11], [19],

[7]) is of the form:

Σ(α) = αS + (1− α) [tr(S)/p] I, (5)

where α ∈ [0, 1) denotes the shrinkage (or regularization)parameter and I denotes an identity matrix of appropriate size.We use the automated method for choosing the tuning param-eter α proposed in [7] and the respective estimators, calledEll1-RSCM and Ell2-RSCM. These estimators are based onα that is a consistent estimator of αo that minimize the meansquared error,

αo = arg minα∈[0,1)

E[‖Σ(α)−Σ‖2F

]. (6)

See Section A for details.Note that SCRDA [3] uses a different estimator, Σ = αS+

(1−α)I, where the regularization parameter α is chosen usingcross-validation (CV). The benefit of Ell1-RSCM and Ell2-RSCM estimators is that they avoid computationally intensiveCV scheme and are fast to compute. In the HD settings,another computational burden is related to inverting the matrixΣ in (5). The matrix inversion can be done efficiently usingthe SVD-trick detailed in Section B.

B. Penalized SCM estimator using Riemannian distanceAssume that {xi}ni=1 is a random sample from a MVN

distribution. Then the maximum-likelihood estimator of Σ isfound by minimizing the −(2/n)× Gaussian log-likelihoodfunction,

L(Σ; S) = tr(Σ−1S

)+ log |Σ|

over the set of positive definite symmetric p × p matrices,denoted by Sp++. When n > p and the SCM S is non-singular, the minimum is attained at Σ = S. Since p � nand S is singular, it is common to add a penalty functionΠ(Σ) to the likelihood function to enforce a certain structureto the covariance matrix and to guarantee that the combinedobjective function has a unique (positive definite matrix)solution. One such penalized SCM (PSCM) proposed by [8]uses a Riemannian distance of Σ from a scaled identity matrix,Π(Σ;m) = ‖ log(Σ) − log(m)I‖2F, where m > 0 is ashrinkage constant. One then solves the penalized likekelihoodproblem:

Σ = arg minΣ∈Sp++

L(Σ; S) + η‖ log(Σ)− log(m)I‖2F, (7)

where η > 0 is the penalty parameter. As in [9], we choosem as the mean of the eigenvalues (d1 ≥ · · · ≥ dp ≥ 0) of theSCM, i.e.,

m =tr(S)

p=

1

p

p∑i=1

di.

The eigenvalues of Σ are then shrunken towards m, andΣ → mI in the limit as η → ∞ [8]. The solution to(7), which we refer to as Rie-PSCM estimator, does nothave a closed-form solution, but a relatively efficient Newton-Raphson algorithm [9] can be used to compute the solution.We choose the associated penalty parameter η using the cross-validation procedure proposed in [9].

III. COMPRESSIVE RDAIn order to explain the proposed compressive RDA (CRDA)

approach, we first write the discriminant rule in (4) in vectorform as

d(x) = (d1(x), . . . , dG(x))

= x>B− 1

2diag

(M>B

)+ lnπ, (8)

where lnπ = (lnπ1, . . . , lnπG), M =(µ1 . . . µG

)and

the coefficient matrix B is defined as,

B =(b1 · · · bG

)= Σ

−1M. (9)

The discriminant function in (8) is linear in x with coefficientmatrix B ∈ Rp×G. Hence, if the i-th row of the coefficientmatrix B is a zero vector 0, then the i-th feature doesnot contribute to the classification rule and hence can beeliminated. If the coefficient matrix B is row-sparse, then thediscriminant rule achieves simultaneous feature elimination. s

To achieve row-sparsity, we apply hard-thresholding (HT)operator HK(B, ϕ) on the coefficient matrix B, defined asa transform HK(B, ϕ) which retains only the elements ofK rows of B that obtained largest values of HT-selectorfunction ϕ : RG → [0,∞) and set elements of other rows tozero. See Figure 1 for an illustration. The compressive RDAdiscriminant function is then defined as

d(x) =(d1(x), . . . , dG(x)

)= x>B − 1

2diag

(M>B

)+ lnπ, (10)

whereB = HK(B, ϕ). (11)

By its construction, the CRDA classifier uses only K fea-tures and thus performs feature selection simultaneously withclassification. The tuning parameter K determines how manyfeatures are selected. As HT-selector function ϕ, we use `q-norm,

ϕq(b) = ‖b‖q, for q ∈ {1, 2,∞},

or the sample variance,

ϕ0(b) = s2(b) =1

G− 1

G∑i=1

|bi − b|2,

where b = (1/G)∑Gi=1 bi. Two dimensional CV search

described in subsection III-A is used to choose an appropriatelevel for joint-sparsity K and the best HT-selector function.

Shrunken Centroids Regularized Discriminant Analysis(SCRDA) [3] uses an estimator Σ = αS + (1 − α)I andelement-wise soft-shrinkage. SCRDA can be expressed in theform (10) with

B = S∆(B), (12)

where the soft-thresholding function is applied element-wiseto B. A disadvantage of SCRDA is that the parameter ∆ ∈[0,∞) is the same across all groups, and different populationswould often benefit from different shrinkage intensity. Anotherdrawback is that an upper bound for shrinkage threshold ∆ isdata dependent. Also in SCRDA, two-dimensional CV searchis used to find the tuning parameter pair α and ∆.

Page 4: A Compressive Classification Framework for High-Dimensional … · 2020-05-12 · Muhammad Naveed Tabassum Member, IEEE and Esa Ollila Member, IEEE We propose a compressive classification

A COMPRESSIVE CLASSIFICATION FRAMEWORK FOR HIGH-DIMENSIONAL DATA 4

´ = ¼

Length

Row-wise variance or lq-norm

Sort

in descending order

K-features selection

i.e., p - K zero-rows

¼

¼11

2

2

p

p

¼ ¼

m1 mG

K

S-1

b1 bG

B B

l

MHK (B , q)

Fig. 1: Computing the K-rowsparse coefficient matrix B in (11)

A. Cross-validation scheme

The joint-sparsity order K ∈ {1, 2, . . . , p} and the HT-selector function ϕ ∈ [ϕ] = {ϕ0, ϕ1, ϕ2, ϕ∞} are chosenusing a cross-validation (CV) scheme. For K we use alogarithmically spaced grid of K-values, where the upperbound for largest element in the grid is defined as follows.For each ϕ, we calculate

KUB(ϕ) =

p∑i=1

I(ϕ(bi:) ≥ avej{ϕ(bj:)}), (13)

where bi: denotes the i-th row vector of B. Thus KUB(ϕ1)counts the number of row vectors of B that has `1-normlarger than the average of `1-norms of row vectors of B.As final upper bound KUB we use the minimum value,KUB = minϕ∈[ϕ]KUB(ϕ). We then use a logarithmicallyspaced grid of J = 10 values [K] = {K1,K2, . . . ,KJ},with 1 < K1 < K2 < · · · < KJ = KUB starting fromK1 = b0.05 · pc.

Apart from the way the lower and upper bounds for the gridof K values are chosen, the used CV scheme is conventional.We split the training data set X into Q folds and use Q −1 folds as a training set and one fold as the validation set.We fit the model (CRDA classifier as defined by (10) and(11)) on the training set using ϕ ∈ [ϕ] and K ∈ [K] andcompute the misclassification error rate on the validation set.This procedure is repeated so that each of the Q folds is leftout in turn as a validation set, and misclassification rates areaccumulated. We then choose the pair in the grid [ϕ] × [K]that had the smallest misclassification error rate. In case ofties, we choose the one having smaller value of K, i.e., weprefer model with smaller number of features.

B. CRDA variants

We consider the following three CRDA variants:

• CRDA1 uses Ell1-RSCM estimator to compute the co-efficient matrix B. The tuning parameters, K and ϕ, arechosen using the CV procedure of subsection III-A.

• CRDA2 is as CRDA1 except that the B is computedusing the Ell2-RSCM estimator.

• CRDA3 uses Rie-PSCM to compute the coefficient ma-trix B, ϕ∞(·) = ‖ · ‖∞ is used as the HT-selector andK = KUB(ϕ∞) as the joint-sparsity level.

Uniform priors (πg = 1/G, g = 1. . . . , G) are used in allanalysis results.

C. Analysis of real data sets

Next, we explore the performance of CRDA variants in clas-sification task of three different types of data: an image data(data set #1), speech data (data set #2) and gene expressiondata (data set #3) described in Table II. The first data setis constructed for classification of five English vowel lettersspoken twice by thirty different subjects while the second dataset consists of modified observations from the MNIST data setof handwritten digits (namely, digits 4 and 9) and used in NIPS2003 feature selection challenge. These data sets are availableat UC Irvine machine learning repository [20] under thenames, Isolette and Gisette data set, respectively. The 3rd dataset is from Khan et al. [21] and consists of small round bluecell tumors (SRBCT-s) found in children, and are classifiedas BL (Burkitt lymphoma), EWS (Ewing’s sarcoma), NB(neuroblastoma) and RMS (rhabdomyosarcoma). The availabledata is split randomly into training and test sets of fixedlength, n and nt, respectively, such that relative proportionsof observations in the classes are preserved as accurately aspossible. Then L = 10 such random splits are performed andaverages of error rates are reported.

The basic metrics used are the test error rate (TER),feature selection rate (FSR) and the average computationaltime (ACT) in seconds. TER is the misclassification ratecalculated by classifying the observations from the test setusing the classification rule estimated from the training set.FSR is defined as FSR = S/p, where S denotes the numberof features used by the classifier, and it is a measure ofinterpretability of the chosen model. The FSR of CRDA3 willnaturally always be higher than FSR of CRDA1 or CRDA2.

The results of the study reported in Table I is divided inthree groups. The first group consists of the proposed CRDAclassification methods, the middle group consists of five state-of-the-art classifiers (cf. Table III). The third group consistsof three standard off-the-shelf classifiers commonly used inmachine learning, but which do not perform feature selection.The methods in the third group are RF which refers torandom forest using decision trees (where maximum numberof splits in tree was 9 and number of learning cycles was150), LogitBoost which refers to LogitBoost method [22] usingstumps (i.e., a two terminal node decision tree) as base learnersand 120 as the number of boosting iterations, and LinSVMwhich refers to a linear support vector machine (SVM) usingone versus one (OVO) coding design and linear kernel. Notethat since p� n, one can always find a separating hyperplaneand hence a linear kernel suffices for SVM.

We first discuss the results of CRDA variants when com-

Page 5: A Compressive Classification Framework for High-Dimensional … · 2020-05-12 · Muhammad Naveed Tabassum Member, IEEE and Esa Ollila Member, IEEE We propose a compressive classification

A COMPRESSIVE CLASSIFICATION FRAMEWORK FOR HIGH-DIMENSIONAL DATA 5

pared against one another. In the case of data set #2, CRDA1and CRDA3 obtained notably better misclassification ratesthan CRDA2. In the case of data set #3, CRDA1 and CRDA2obtained perfect 0% misclassification results while using only5% out of all p = 2308 features. This is a remarkableachievement, especially when compared to TER/FSR ratesof the competing estimators. In the case of data set #1,CRDA3 had better performance than CRDA1 and CRDA2but it attained this result using more features than CRDA1and CRDA2.

Next, we compare CRDA against the methods in the middlegroup. We notice from Table I that one of the CRDA variants isalways the best performing method in terms of both accuracy(TER rates) and feature selection rates (FSR). Only in the caseof data set #2, SCRDA is able to achieve equally good errorrate (6.3%) as CRDA3.

Finally, we compare the CRDA methods against the meth-ods in the third group. First we notice that the linear SVM isthe best performing method among the third group, and fordata set #2 it achieves the best error rate among all methods.However, it chooses all features.

As noted earlier, for data set #3 (Khan et al.), CRDA1and CRDA2 attained perfect classification (0% error rate)while choosing only 5% out of 2308 genes. The heat mapof randomly selected 115 genes is shown in Figure 2a whileFigure 2b shows the heat map of 115 genes picked byCRDA1. While the former shows random patterns, clearlyvisible patterns can be seen in the latter heat map. Besidesfor prognostic classification, the set of genes found importantby CRDA1 can be used in developing understanding of genesthat influence SRBCT-s.

- - - -

- - --

- --

- -

- -- -

-

- -

- -

... - ... ... --

.. - - -- - - - -- - --

-

- - -

- --

- --

- - - --

- - - --

- - - - - -

- - - - - - -

- - - - - - - - - -

- - - -

- -

- - - - - - - -

- - - -

- - -

- -- -- --- - -

-- - -

-

- - -

BL EWS NB RMS

(a)

BL EWS NB RMS

(b)

Fig. 2: For the data set #3 (Khan et al.), the figure displaysthe heat map of (a) randomly selected 115 genes and (b) 115genes that were selected by CRDA1 classifier. We note that theCRDA1 classifier obtained 0% error rate on all 10 Monte Carlosplits of the data to training and test set, and outperformed allother classifiers.

IV. FEATURE SELECTION PERFORMANCE

Next we create a partially synthetic gene expression dataset using the 3rd data set in Table II to study how well the

TABLE I: Classification results of CRDA variants and theircompetitors (see Table III) for data sets #1-#3 in terms of testerror rate (TER) and feature selection rates (FSR) reported inpercentages.

#1 Isolet-vowels #2 Gisette #3 Khan et al.TER FSR TER FSR TER FSR

CRDA1 2.8 21.8 7.0 18.8 0 5.0CRDA2 2.7 16.3 14.4 8.2 0 5.0CRDA3 1.3 45.1 6.3 36.7 1.2 35.5

PLDA 3.6 100 40.5 50.8 18.8 82.0SCRDA 1.6 88.2 6.3 87.0 2.8 81.5varSelRF 3.0 28.9 7.9 18.8 2.4 78.8

Logit-ASPLS 26.4 98.6 27.8 90.0 43.2 100SPCALDA 19.1 73.7 6.4 47.4 3.2 46.9

SVM 1.3 100.0 6.0 100.0 0.8 100.0LogitBoost 3.0 100.0 7.7 100.0 10.4 100.0

RF 1.8 100.0 9.5 100.0 1.2 100.0

different classifiers are able to detect differentially expressed(DE) genes. The data set #3 consists of expression levels ofp = 2308 genes and we randomly kept p1 = 115 (i.e., ≈ 5% ofp) covariates of gene expression levels as significant features,i.e., differentially expressed, and the rest p0 = 2193 irrelevant(non-DE) features are generated from Nn(0, 0.01× I).

Since we now have a ground truth of differentially expressedgenes available, we may evaluate the estimated models usingthe false-positive rate (FPR) and the false-negative rate (FNR).A false-positive is a feature that is picked by the classifier butis not DE in any of the classes in the sense that µg,j = 0 forall g = 1, . . . , G. A false-negative is a feature that is DE inthe true model in the sense that µg,j 6= 0 for all g = 1, . . . , G,but is not picked by the classifier. Formally,

FPR =F

p0and FNR =

p1 − Tp1

,

where p1 is the number of truly positive (DE) features, p0

is the number of truly negative (non-DE) features, F is thenumber of false positives and T is the number of true positives,i.e., S = F + T gives the total number of features picked bythe classifier. Again, recall that the sparsity level K used inthe hard-thresholding operator is the number S in the caseof CRDA methods. Note that both FPR and FNR should beas small as possible for exact and accurate selection (andelimination) of genes.

First, we compare the gene selection performance of CRDAwith the conventional gene selection approach [29] whichperforms a t-test and computes the p-values using permutationanalysis using 10,000 permutations. A gene is consideredto be differentially expressed when it shows both statisticaland biological significance. Statistical significance is attestedwhen the obtained p-value of the test statistics is belowthe significance level (false alarm rate), say pFA = 0.05.Biological significance is attested when the log ratio of foldchange of mean gene expressions is outside the interval[− log2(c), log2(c)] for some cutoff limit c, say c = 1.2.Figure 3(a) display such a volcano plot which displays the-log10 of p-values of t-tests for each genes against the log2

fold change. The used statistical significance level is shown

Page 6: A Compressive Classification Framework for High-Dimensional … · 2020-05-12 · Muhammad Naveed Tabassum Member, IEEE and Esa Ollila Member, IEEE We propose a compressive classification

A COMPRESSIVE CLASSIFICATION FRAMEWORK FOR HIGH-DIMENSIONAL DATA 6

TABLE II: Summary of the real data sets used in this paper.

data set p n nt {n1, . . . , nG} G Description

#1 Isolet-vowels [20] 617 100 200 {20,20,20,20,20} 5 Spoken English vowel letters#2 Gisette [20] 5000 350 6650 {175,175} 2 OCR digit recognition (’4’ vs ’9’)#3 Khan et al. [21] 2308 38 25 {5, 14, 7, 12} 4 SRBCTs#4 Su et al. [23] 5565 62 40 {15, 16, 17, 14} 4 Multiple mammalian tissues#5 Gordon et al. [24] 12533 109 72 {90, 19} 2 Lung Cancer#6 Burczynski et al. [25] 22283 76 51 {35, 25, 16} 3 Inflammatory bowel diseases (IBD)#7 Yeoh et al. [26] 12625 148 100 {9,16, 38, 12, 26, 47} 6 Acute lymphoblastic leukemia (ALL)#8 Tirosh et al. [27] 13056 114 77 {23, 22, 23, 46} 4 Yeast species#9 Ramaswamyet al. [28] 16063 117 73 {7 ,6, 7, 7,13, 7, 6, 6, 18, 7, 7, 7, 7,12} 14 Cancer

as the dotted horizontal line and the biological significancebounds as dotted vertical lines.

The plot illustrates that there are 19 out of 115 differentiallyexpressed genes which are both statistically and biologicallysignificant (genes on the upper left and upper right corners ofthe plot) along with a few which are biologically significantbut not statistically significant (lower left and lower rightcorners). When considering each class in turn against the rest,then 60 out of a total of 115 DE genes are picked as beingDE. The results for CRDA2 based gene selection is shownin Figure 3(b) for the same data set where we have plottedthe gene index, i, against the sample variance, si, of the ithrow vector of the coefficient matrix B. From 115 DE genes,CRDA2 has selected 107 DE genes.

-1 -0.5 0 0.5 1

log2(ratio)

0

1

2

3

-lo

g1

0(p

-va

lue)

Fold-change threshold (1.2) p-value cutoff

(a)

0 500 1000 1500 2000

Gene index, i

-2

0

2

4

log

10

(len

gth

)

All Differentially expressed Selected

(b)

Fig. 3: Feature selection results for the partially syntheticgenomic data. (a) Volcano plot illustrating that 19 (9 up-regulated and 10 down-regulated) genes are selected whencomparing the third class against the others. When each classare compared in turn against the others, then 60 out of 115DE genes are picked. (b) Plot of gene index, i, versus thelog10 of the length of the ith row vector of the coefficientmatrix B produced by HT-selector ϕ0, i.e., length = ϕ0(bi:).CRDA2 has selected 107 differentially expressed genes out oftotal 115.

As a baseline, we further compute the test error rate ofthe naive classifier that assigns all observations in the testset to the class that has the most observations in the trainingset. Figure 4(a) displays the box plots of FSR and TER ofCRDA methods and three state-of-the-art classifiers. Againthe CRDA classifiers obtained the smallest average TER whilechoosing the smallest number of features. Figure 4(b) providesthe box plots of FPR and FNR for all methods. The proposedCRDA methods offer the best feature selection accuracy. On

the contrary, SCRDA and varSelRF methods produce largefalse-positive rates, meaning that they have chosen a largeset of genes as DE while they are not. PLDA is performingequally well with CRDA in terms of FPR/FNR rates, but itsmisclassification rates are considerably larger. Furthermore,CRDA methods exhibit the smallest amount of variation (i.e.,standard error) in all measures (TER/FSR/FPR/FNR). We alsonotice that varSelRF does not perform very well compared toCRDA or other regularized LDA techniques like SCRDA.

0

50

100T

ER

| F

SR

TER(naive)TERFSR

(a)

CRDA1CRDA2

CRDA3PLDA

SCRDAvarSelRF

SPCALDA

logit-ASPLS

0

50

100

FP

R |

FN

R

FPRFNR

(b)

Fig. 4: Classication accuracy results for the partially syntheticgene expression data. (a) Boxplots of test error rates (TER)and feature selection rates (FSR). (b) Boxplots of false neg-ative rates (FNR) and false positive rates (FPR) reported inpercentages. Results are from simulation study of 10 randomsplits of the data to training set and test set.

V. ANALYSIS OF REAL GENE EXPRESSION DATA

We compare the performance of the proposed CRDA meth-ods to seven state-of-the-art classifiers on a diverse set ofreal gene expression (microarray) data sets. These data sets#4 – #9 in Table II constitute a representative set of geneexpression data sets available in the open literature, having avarying number of features (p), sample sizes (ng) or classes(G). As earlier, in each Monte-Carlo trial, the data is dividedrandomly into training sets and test sets of fixed length, n andnt, respectively, such that relative proportions of observationsin the classes are preserved as accurately as possible. We donot include CRDA3 in this study due to incurred high compu-

Page 7: A Compressive Classification Framework for High-Dimensional … · 2020-05-12 · Muhammad Naveed Tabassum Member, IEEE and Esa Ollila Member, IEEE We propose a compressive classification

A COMPRESSIVE CLASSIFICATION FRAMEWORK FOR HIGH-DIMENSIONAL DATA 7

TABLE III: State-of-the-art methods included in our comparative study. All of the included methods are designed with an aim ofperforming feature selection during the learning process. Tuning parameters are chosen using the default method implementedin the R-packages, including the grids for parameter tuning. When CV is used, then a five-fold CV is selected.

Method Reference R-package or other Version

PLDA Penalized LDA [4] penalizedLDA 1.1SCRDA Shrunken centroids RDA [3] rda 1.0.2.2varSelRF Random forest (RF) with variable selection [30] varSelRF 0.7.8Logit-ASPLS Adaptive sparse partial least squares based logistic regression [2] plsgenomics 1.5.2SPCALDA Supervised principal component analysis based LDA [31] SPCALDA 1.0Lasso Regularized multinomial regression with `1 (Lasso) penalty [32] GLMNet 3.0-2EN Regularized multinomial regression with Elastic Net (EN) penalty [32] GLMNet 3.0-2

tational cost related to computing the Rie-PSCM covariancematrix estimator.

The Gene Expression Omnibus (GEO) data repository con-tains the data sets #6 and #8 with accession numbers GDS1615and GDS2910, respectively. The 9th data set (also known asthe global cancer map (GCM) data set) is available online[33]. The rest of the data sets, namely data sets 4, 5, and 7are included in the R-package datamicroarray1.

Figure 5 displays the average TER and FSR values. CRDAmethod obtained the lowest error rates for 5 out of 6 data setsconsidered; only for data set #5 (Gordon), Logit-ASPLS hadbetter performance than CRDA. Furthermore, CRDA achievedthe top performance by picking a largely reduced number ofgenes: the number of genes selected ranged from 5.1% to23.4%. Only Lasso and EN selected fewer number of genesthan CRDA methods but their error rates were much higherin all cases except for data set #5 (Gordon).

For some data sets, such as for the seventh (Yeoh) andeight (Tirosh), SCRDA and CRDA obtained comparable TERvalues, but the former fails to eliminate as many genes. Againwe observe that many of the competing classifiers, such asvarSelRF, PLDA or SPCALDA are performing rather poorlyfor some data sets; this is especially the case for the sixth(Burczynski) and ninth (Ramaswamy) data sets.

Data set #9 (Ramaswamy) presents a very difficult dataset for any any classification method: it contains G = 14classes, p = 16063 genes and only 117 observations in thetraining set. Results obtained by CRDA clearly illustratesits top performance compared to others in very demandingHDLSS data scenarios. CRDA classifiers outperformed mostof the methods with a significant margin: CRDA gives smallesterror rates (25.2%) while at the same time reducing the numberof features (genes) used in the classification procedure. Thenext best method is varSelRF, but it is still far behind (32.6%).We note that Logit-ASPLS method failed for this data set dueto computational and memory issues and hence it is excluded.

Finally, bottom panel of Figure 5 displays the ACT val-ues (in seconds) of the methods. Variation in ACT betweendifferent data sets is due to different number of features (p)and observations (n) of data sets. In terms of computationalcomplexity, CRDA looses to Lasso, EN and LinSVM. How-ever, compared to other methods that are designed to performfeature selection, CRDA is among the most efficient method.

1https://github.com/ramhiser/datamicroarray

Particularly, one can note that Logit-ASPLS and varSelRF arecomputationally very intensive.

VI. DISCUSSIONS AND CONCLUSIONS

We proposed a novel compressive regularized discrimi-nant analysis (CRDA) method for high-dimensional data sets,where data dimensionality p is much larger than the samplesize n (i.e., p � n, where p is very large, often severalthousands).

The method showcase accurate classification results andfeature selection ability while being fast to compute in real-time. Feature selection is important for post-selection inferencein various problems as it provides insights on which features(e.g., genes) are particularly important in expressing a partic-ular trait (e.g., cancer type).

Our simulation studies and real data analysis examplesillustrate that CRDA offers top performance in all three con-sidered facets: classification accuracy, feature selection, andcomputational complexity. Software (both MATLAB and R)toolboxes to reproduce the results are available: see Section Ifor links to the packages.

APPENDIX ATUNING PARAMETER SELECTION OF ELL1- AND

ELL2-RSCM ESTIMATOR

Ell1-RSCM and Ell2-RSCM estimators [7] computes anestimate α of the optimal shrinkage parameter αo in [7] as

α =(γ − 1)

(γ − 1) + κ(2γ + p)/n+ (γ + p)/(n− 1), (14)

where γ is an estimator of the sphericity parameter γ =ptr(Σ2)/tr(Σ)2 and κ is an estimate of the elliptical kurtosisparameter κ = (1/3)kurt(xi), calculated as the average sam-ple kurtosis of the marginal variables and scaled by 1/3. Theformula above is derived under the assumption that the data isa random sample from an unspecified elliptically symmetricdistribution with mean vector µ an covariance matrix Σ.

Ell1-RSCM and Ell2-RSCM estimators differ only in theused estimate γ of the sphericity parameter in (14). Estimateof sphericity used by Ell1-RSCM is

γ = min

(p,max

(1 ,

n

n− 1

(ptr(S2)− p

n

)))(15)

Page 8: A Compressive Classification Framework for High-Dimensional … · 2020-05-12 · Muhammad Naveed Tabassum Member, IEEE and Esa Ollila Member, IEEE We propose a compressive classification

A COMPRESSIVE CLASSIFICATION FRAMEWORK FOR HIGH-DIMENSIONAL DATA 8

Su et al.

0.3

0.3 1.3

0.32 2

25.8

2.3

2.5

0.8

0

50

100

TE

RGordon et al.

2.5

2.5

12.5

2.6

30.6

2.1

16.7

0.32.4

1.5

TER(naive)

Burczynski et al.

7.6

7.5

48.6

10.6

49.6

18.8

50.2

36.3

10.4

9.8

Yeoh et al.

1.6

1.6 3.4

2.99

.2

5.3

32

49.5

3.1

2.7

Tirosh et al.

4.8

4.7 5.7

5.1

47

14.2

35.2

27.8

9.7

6.8

Ramaswamy et al.

25.5

25.2

45.1

40.147.5

32.64

3.4

76.8

47.8

5.3

5.1

10

0

81

.7

78

.4

58

.6

99

.9

53

.1

0.1

0.7

0

50

100

FS

R

5.3

5.3

10

0

99

.5

11

.8

5.4

10

0

2.1

0.1

0.4 6 6

.1

10

0

10

0

7.9

2.8

10

0

64

.1

0.1

0.3 7

.4

7.4

10

0

90

.1

59

.7

20

.2

10

0

73

.6

0.1

0.3

23

.4

23

.4

10

0

99

.9

69

.7

22

.1

10

0

85

.5

0.2

0.6

10

.1

10

.9

10

0

10

0

51

.3

13

.9

10

0

00.1

3 3 0 11

6 24

0

52

3

11

CR

DA

1C

RD

A2

EN

Las

soL

inS

VM

Lo

git

-AS

PL

SP

LD

AS

CR

DA

SP

CA

LD

Av

arS

elR

F

0

200

400

600

AC

T (

seco

nds)

16

15

2

34

0

22

31

45

4

11

CR

DA

1C

RD

A2

EN

Las

soL

inS

VM

Lo

git

-AS

PL

SP

LD

AS

CR

DA

SP

CA

LD

Av

arS

elR

F

0

100

200

300

400

49

44

3 49

33

76

01

18

67

54

CR

DA

1C

RD

A2

EN

Las

soL

inS

VM

Lo

git

-AS

PL

SP

LD

AS

CR

DA

SP

CA

LD

Av

arS

elR

F

0

500

1000

1500

2000

18

17

1 4612

0 51

61

38

36

77

CR

DA

1C

RD

A2

EN

Las

soL

inS

VM

Lo

git

-AS

PL

SP

LD

AS

CR

DA

SP

CA

LD

Av

arS

elR

F

0

1000

2000

3000

4000

16

15

0 2884

36

41

15

01

54

CR

DA

1C

RD

A2

EN

Las

soL

inS

VM

Lo

git

-AS

PL

SP

LD

AS

CR

DA

SP

CA

LD

Av

arS

elR

F

0

500

1000

1500

26

25 45 53

27

2

81

1

1

19

9

69

CR

DA

1C

RD

A2

EN

Las

soL

inS

VM

PL

DA

SC

RD

AS

PC

AL

DA

var

Sel

RF

0

500

1000

Fig. 5: Test error rates (TER) and feature selection rates (FSR) in percentages, and average computational times (ACT) inseconds are reported for the last six real data sets given in Table II. Results are averages over L = 10 random splits of datainto training and test sets.

where S is the sample spatial sign covariance matrix, definedas

S =1

n

n∑i=1

(xi − µ)(xi − µ)>

‖xi − µ‖2,

where µ = arg minµ

∑ni=1 ‖xi−µ‖ is the spatial median. In

Ell2-RSCM, the estimate of sphericity γ is computed as

γ = min

(p,max

(1 , bn

(ptr(S2)

tr(S)2− an

p

n

))), (16)

where

an =

(n

n+ κ

)(n

n− 1+ κ

)bn =

(κ+ n)(n− 1)2

(n− 2)(3κ(n− 1) + n(n+ 1))

and S is the pooled SCM in (3).

APPENDIX BON COMPUTING Σ

−1

The matrix inversion can be performed efficiently using theSVD-trick [3]. The SVD of X is

X = UDV>,

where U ∈ Rp×m, D ∈ Rm×m, V ∈ Rn×m and m =rank(X). Direct computation of SVD is time consuming and

the trick is that V and D can be computed first from SVD ofX>X = VD2V>, which is only an n×n matrix. Here V isan orthogonal n×n matrix whose first m-columns are V andD is an n×n diagonal matrix whose upper left corner m×mmatrix is D. After we compute V and D from SVD of X>X,we may compute U from X by U = XVD−1. Then, usingthe SVD representation of the SCM, S = (1/n)UD2U>, andsimple algebra, one obtains a simple formula for the inverseof the regularized SCM in (5) as:

Σ−1

= U

[(αn

D2 + (1− α)ηIm

)−1

− 1

(1− α)ηIm

]U>

+1

(1− α)ηIp,

where η = tr(S)/p = tr(D2)/np. This reduces the complex-ity from O(p3) to O(pn2) which is a significant saving inp� n case.

ACKNOWLEDGMENT

The research was partially supported by the Academy ofFinland grant no. 298118 which is gratefully acknowledged.

REFERENCES

[1] R. Tibshirani, T. Hastie, B. Narasimhan, and G. Chu, “Class predictionby nearest shrunken centroids, with applications to dna microarrays,”Statistical Science, pp. 104–117, 2003.

Page 9: A Compressive Classification Framework for High-Dimensional … · 2020-05-12 · Muhammad Naveed Tabassum Member, IEEE and Esa Ollila Member, IEEE We propose a compressive classification

A COMPRESSIVE CLASSIFICATION FRAMEWORK FOR HIGH-DIMENSIONAL DATA 9

[2] G. Durif, L. Modolo, J. Michaelsson, J. E. Mold, S. Lambert-Lacroix,and F. Picard, “High dimensional classification with combined adaptivesparse pls and logistic regression,” Bioinformatics, vol. 34, no. 3, pp.485–493, 2018.

[3] Y. Guo, T. Hastie, and R. Tibshirani, “Regularized linear discriminantanalysis and its application in microarrays,” Biostatistics, vol. 8, no. 1,pp. 86–100, 2006.

[4] D. M. Witten and R. Tibshirani, “Penalized classification using Fisher’slinear discriminant,” Journal of the Royal Statistical Society: Series B(Statistical Methodology), vol. 73, no. 5, pp. 753–772, 2011.

[5] A. Sharma, K. K. Paliwal, S. Imoto, and S. Miyano, “A feature selec-tion method using improved regularized linear discriminant analysis,”Machine vision and applications, vol. 25, no. 3, pp. 775–786, 2014.

[6] E. Neto, F. Biessmann, H. Aurlien, H. Nordby, and T. Eichele, “Regular-ized linear discriminant analysis of EEG features in dementia patients,”Frontiers in aging neuroscience, vol. 8, 2016.

[7] E. Ollila and E. Raninen, “Optimal shrinkage covariance matrix es-timation under random sampling from elliptical distributions,” IEEETransactions on Signal Processing, vol. 67, no. 10, pp. 2707–2719, May2019.

[8] P. L. Yu, X. Wang, and Y. Zhu, “High dimensional covariance matrixestimation by penalizing the matrix-logarithm transformed likelihood,”Computational Statistics & Data Analysis, vol. 114, pp. 12–25, 2017.

[9] D. E. Tyler and M. Yi, “Shrinking the sample covariance matrixusing convex penalties on the matrix-log transformation,” arXiv preprintarXiv:1903.08281, 2019.

[10] M. N. Tabassum and E. Ollila, “Compressive regularized discriminantanalysis of high-dimensional data with applications to microarray stud-ies,” in 2018 IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP), 2018, pp. 4204–4208.

[11] O. Ledoit and M. Wolf, “A well-conditioned estimator for large-dimensional covariance matrices,” Journal of multivariate analysis,vol. 88, no. 2, pp. 365–411, 2004.

[12] A. Coluccia, “Regularized covariance matrix estimation via empiricalbayes,” IEEE Signal Processing Letters, vol. 22, no. 11, pp. 2127–2131,2015.

[13] B. Meriaux, C. Ren, M. N. El Korso, A. Breloy, and P. Forster, “Robustestimation of structured scatter matrices in (mis) matched models,”Signal Processing, vol. 165, pp. 163–174, 2019.

[14] A. Wiesel, T. Zhang et al., “Structured robust covariance estimation,”Foundations and Trends R© in Signal Processing, vol. 8, no. 3, pp. 127–216, 2015.

[15] E. Ollila and D. E. Tyler, “Regularized M -estimators of scatter matrix,”vol. 62, no. 22, pp. 6059–6070, 2014.

[16] F. Pascal, Y. Chitour, and Y. Quek, “Generalized robust shrinkageestimator and its application to STAP detection problem,” vol. 62, no. 21,pp. 5640–5651, 2014.

[17] Y. Sun, P. Babu, and D. P. Palomar, “Regularized Tyler’s scatterestimator: Existence, uniqueness, and algorithms,” vol. 62, no. 19, pp.5143–5156, 2014.

[18] Y. Chen, A. Wiesel, Y. C. Eldar, and A. O. Hero, “Shrinkage algorithmsfor MMSE covariance estimation,” vol. 58, no. 10, pp. 5016–5029, 2010.

[19] E. Ollila, “Optimal high-dimensional shrinkage covariance estimationfor elliptical distributions,” in Proc. European Signal Processing Con-ference (EUSIPCO 2017), Kos, Greece, 2017, pp. 1689–1693.

[20] D. Dua and C. Graff, “UCI machine learning repository,” 2017.[Online]. Available: http://archive.ics.uci.edu/ml

[21] J. Khan, J. S. Wei, M. Ringner, L. H. Saal, M. Ladanyi, F. Westermann,F. Berthold, M. Schwab, C. R. Antonescu, C. Peterson et al., “Classifica-tion and diagnostic prediction of cancers using gene expression profilingand artificial neural networks,” Nature medicine, vol. 7, no. 6, p. 673,2001.

[22] J. Friedman, T. Hastie, R. Tibshirani et al., “Additive logistic regression:a statistical view of boosting (with discussion and a rejoinder by theauthors),” The annals of statistics, vol. 28, no. 2, pp. 337–407, 2000.

[23] A. I. Su, M. P. Cooke, K. A. Ching, Y. Hakak, J. R. Walker, T. Wiltshire,A. P. Orth, R. G. Vega, L. M. Sapinoso, A. Moqrich, A. Patapoutian,G. M. Hampton, P. G. Schultz, and J. B. Hogenesch, “Large-scaleanalysis of the human and mouse transcriptomes.” Proceedings of theNational Academy of Sciences of the United States of America, vol. 99,no. 7, pp. 4465–4470, Apr. 2002.

[24] G. J. Gordon, R. V. Jensen, L.-L. Hsiao, S. R. Gullans, J. E. Blumen-stock, S. Ramaswamy, W. G. Richards, D. J. Sugarbaker, and R. Bueno,“Translation of microarray data into clinically relevant cancer diagnostictests using gene expression ratios in lung cancer and mesothelioma,”Cancer Research, vol. 62, no. 17, pp. 4963–4967, 2002.

[25] M. E. Burczynski, R. L. Peterson, N. C. Twine, K. A. Zuberek, B. J.Brodeur, L. Casciotti, V. Maganti, P. S. Reddy, A. Strahs, F. Immermann,W. Spinelli, U. Schwertschlag, A. M. Slager, M. M. Cotreau, and A. J.Dorner, “Molecular classification of Crohn’s disease and ulcerative coli-tis patients using transcriptional profiles in peripheral blood mononuclearcells,” The Journal of Molecular Diagnostics, vol. 8, no. 1, pp. 51 – 61,2006.

[26] E.-J. Yeoh, M. E. Ross, S. A. Shurtleff, W. K. Williams, D. Patel,R. Mahfouz, F. G. Behm, S. C. Raimondi, M. V. Relling, A. Patelet al., “Classification, subtype discovery, and prediction of outcome inpediatric acute lymphoblastic leukemia by gene expression profiling,”Cancer cell, vol. 1, no. 2, pp. 133–143, 2002.

[27] I. Tirosh, A. Weinberger, M. Carmi, and N. Barkai, “A genetic signatureof interspecies variations in gene expression,” Nature genetics, vol. 38,no. 7, p. 830, 2006.

[28] S. Ramaswamy, P. Tamayo, R. Rifkin, S. Mukherjee, C.-H. Yeang,M. Angelo, C. Ladd, M. Reich, E. Latulippe, J. P. Mesirov et al.,“Multiclass cancer diagnosis using tumor gene expression signatures,”Proceedings of the National Academy of Sciences, vol. 98, no. 26, pp.15 149–15 154, 2001.

[29] X. Cui and G. A. Churchill, “Statistical tests for differential expressionin cDNA microarray experiments,” Genome biology, vol. 4, no. 4, p.210, 2003.

[30] R. Diaz-Uriarte, “GeneSrF and varSelRF: a web-based tool and rpackage for gene selection and classification using random forest,” BMCBioinformatics, vol. 8, no. 1, p. 328, Sep 2007.

[31] Y. Niu, N. Hao, and B. Dong, “A new reduced-rank linear discriminantanalysis method and its applications,” Statistica Sinica, vol. 28, no. 1,pp. 189–202, 1 2018.

[32] J. Friedman, T. Hastie, and R. Tibshirani, “Regularization paths forgeneralized linear models via coordinate descent,” Journal of statisticalsoftware, vol. 33, no. 1, p. 1, 2010.

[33] Cancer Program Legacy Publication Resources - Broad Institute,“Global cancer map dataset,” 2001, http://portals.broadinstitute.org/cgi-bin/cancer/publications/view/61, Last accessed on 2018-09-18.

MUHAMMAD NAVEED TABASSUM (SM’14)is expecting to complete the D.Sc. (Tech) degreein signal processing and data science from AaltoUniversity, Espoo, Finland, in 2020. He receivedan M.Sc. degree in Electrical Engineering fromKing Saud University, Riyadh, Saudi Arabia, in2015, and worked in multidisciplinary teams onfunded research projects at NPST and RFTONICSresearch centers. Earlier, he developed iOS appswhile working in agile software development teamsat Coeus Solutions GmbH, Lahore Branch, Pakistan.

His general career interests involve the research and development of softwaresolutions to signal processing, data science, electromagnetic radiation andmobile communications related problems. Particularly, interested in the omicsdata analysis, optimization algorithms, and statistical-, machine- and deep-learning approaches.

Page 10: A Compressive Classification Framework for High-Dimensional … · 2020-05-12 · Muhammad Naveed Tabassum Member, IEEE and Esa Ollila Member, IEEE We propose a compressive classification

A COMPRESSIVE CLASSIFICATION FRAMEWORK FOR HIGH-DIMENSIONAL DATA 10

Esa Ollila (M’03) received the M.Sc. degree inmathematics from the University of Oulu, Oulu,Finland, in 1998, Ph.D. degree in statistics withhonours from the University of Jyvaskyla, Jyvaskyla,Finland, in 2002, and the D.Sc.(Tech) degree withhonours in signal processing from Aalto University,Aalto, Finland, in 2010. From 2004 to 2007, he wasa Postdoctoral Fellow and from August 2010 to May2015 an Academy Research Fellow of the Academyof Finland. He has also been a Senior Lecturer withthe University of Oulu. Currently, since June 2015,

he has been an Associate Professor of Signal Processing, Aalto University,Finland. He is also an Adjunct Professor (Statistics) of Oulu University. Dur-ing the Fall-term 2001, he was a Visiting Researcher with the Department ofStatistics, Pennsylvania State University, State College, PA while the academicyear 2010–2011 he spent as a Visiting Postdoctoral Research Associate withthe Department of Electrical Engineering, Princeton University, Princeton, NJ,USA. He is a member of EURASIP SAT in Theoretical and MethodologicalTrends in Signal Processing and co-author of a recent textbook, RobustStatistics for Signal Processing, published by Cambridge University Pressin 2018. Recently, he organized a special session on Robust Statistics inSignal Processing in CAMSAP 2019 and a special issue on Statistical SignalProcessing Solutions and Advances for Data Science in Signal Processing(Elsevier). His research interests lie in the intersection of the fields of statisticalsignal processing, high-dimensional statistics, bioinformatics and machinelearning with emphasis on robust statistical methods and modeling.