2013 amalthea reu program analysis of medical data using …rec2111/fit-poster.pdf · 2013. 12....

2013 AMALTHEA REU Program

Analysis of Medical Data Using Dimensionality Reduction Techniques Robert E. Colgana, David E. Gutierrezb, Jugesh Sundramc, Gnana Bhaskar Tenalid

a Computer Science, Columbia University, b Computer Science, Florida State University, c Mechanical Engineering, Florida Institute of Technology, d Mathematical Sciences, Florida Institute of Technology

This material is based upon work/research supported in part by the

National Science Foundation under Grant No. 1263011.

W e studied a dataset from the African American Study of Kidney Disease and Hypertension (AASK), consisting of 116 instances of 5251 fea-

tures (courtesy: Dr. M. Lipkowitz of Georgetown University Hospital and Dr. M. Subasi of FIT). All patients in the dataset suffer from Chronic Kidney Disease (CKD). Patients are clas-sified as either slow or fast progressors. Features corre-spond to serum proteomic levels (peaks extracted from raw SELDI-tof mass spectrometry data).

Applying k-nearest neigh-bors (k = 3) to the unre-duced data yields an over-all accuracy of 0.58, with specificity 0.551 and sen-sitivity 0.623. The table at right shows overall accu-racy (Acc), sensitivity (Se), and specificity (Sp) for k-nearest neighbors on DM-reduced data using sever-al kernel functions (timesteps = 2).

Kidney Dataset

NLDR techniques such as DM can effectively transform high-dimensional medical data to as few as one dimension while maintaining its important characteristics, allowing us to classi-fy data points with similar accuracy as on the raw data. This saves computational time and space.

DR methods involve feature transformation. Therefore, in most cases an interpretation of the reduced data may not be possible in terms of the original features. However, certain im-portant biomarkers could be devised based on the reduced data.

Classification of certain data can be complicated due to its in-trinsic features. DR methods can improve results of classifica-tion algorithms.

Future work: study the DR methods on kidney datasets to evolve suitable biomarkers.

Conclusions

H igh-dimensional data can be difficult to analyze, al-most impossible to visualize, and expensive to process and store due to the so called “curse of di-

mensionality.” In many cases, the high-dimensional data points may all lie on or close to a much lower-dimensional surface, or manifold, whereby the intrinsic dimensionality of the data may be much lower. In that case, the data could be described with fewer dimensions, allowing us to mitigate the curse of dimensionality. Transforming the high-dimensional representation of the data to a lower dimen-sion without losing important information is the central problem of dimensionality reduction (DR). Many methods of DR have been developed, including classical (linear) Princi-pal Component Analysis (PCA) and newer (nonlinear) meth-ods such as Diffusion Maps (DM). Most methods perform well on some types of data but poorly on others. We applied different methods to medical data, including breast tissue tumor data and kidney proteomics data. To evaluate the performance of the reduction method, we attempted to classify the data in the reduced dimension and evaluated the accuracy.

Abstract

W e illustrate the use of DR methods through the following example (Shlens 2005): consid-er the motion of a ball attached to a spring.

We set up three cameras to capture the motion of the ball. Each camera captures an x and y coordinate every frame for N frames. Thus we obtain N data points, each with 6 dimensions.

By applying a dimensionality reduction technique, we can uncover the one-dimensional representation of the ball’s motion, as shown above. This reduced data better reflects the motion of the ball, and also is easy to store and process.

R educe high-dimensional data to a lower-dimensional space, while maintaining the most important characteristics, in order to under-

stand its underlying structure, visualize and process it efficiently.

Objective

Dimensionality Reduction

A A B B C C

A A B B C C

N

A

N

A

N

B

N

B

N

C

N

C

Dimensionality Reduction

Camera C

Camera B

Camera A

Camera A Camera B Camera C

T he Wisconsin Diagnostic Breast Cancer (WDBC) da-taset consists of 569 data points classified as malig-nant or benign. Each instance contains 30 features

describing characteristics such as radius, perimeter, texture, and smoothness of the cell nuclei, taken from images of fine-needle aspirates of breast tissue. We tested the classifica-tion accuracy of k-nearest neighbors on the original data and the data after reducing it to 3-D, 2-D, and 1-D with diffu-sion maps.

The table to the right shows the overall clas-sification accuracy, sensitivity (proportion of malignant tumors correctly identified), and specificity (proportion of benign tumors correctly identified), with k = 11.

Breast Cancer Dataset

Dimensions Accuracy Sensitivity Specificity

30 (Original) 0.933 0.874 0.968

30 to 3 0.928 0.852 0.973

30 to 2 0.913 0.837 0.959

30 to 1 0.908 0.835 0.952

I n cluster analysis, most clustering algorithms like k-means depend on the number of clusters k as input. To determine the optimal k, a “cluster counting” algorithm

such as gap statistics or X-means is employed.

A novel clustering algorithm, Redundant Seed Clustering (RSC), is proposed which returns k, the best count of clus-ters, and simultaneously partitions the data into k clusters by returning the cluster centroids and class labels.

Redundant Seed Clustering

A few of the DR techniques we studied are shown below.

DR Methods

Dimensionality reduction

Linear methods

Principal Component

Analysis (PCA)

Independent Component

Analysis (ICA)

Nonlinear methods

Locally-Linear Embedding

(LLE)

Diffusion Maps (DM)

Kernel PCA (KPCA)

T he intrinsic geometry of data points lying on a non-linear manifold is better captured by applying non-linear dimensionality reduction (NLDR) methods

than linear methods. This point is highlighted by applying PCA (linear DR method) and LLE (NLDR method) on the “Swiss roll” dataset.

Nonlinear DR

Original Swiss roll in 3-D Swiss roll reduced to 2-D by PCA Swiss roll reduced to 2-D by LLE

NLDR Methods Diffusion Maps attempts to discover the underlying structure of the data by considering random walks along the surface of the data manifold. It maps points into a “diffusion space” where distance is based on the probability of paths.

Locally Linear Embedding re-constructs each data point from a linear combination of its k nearest neighbors. Alt-hough the data may lie on a nonlinear manifold, LLE as-sumes that each point and its k nearest neighbors lie on or close to a linear manifold at a local level.

To understand the performance of NLDR methods, we tested

LLE (left) and DM (right) on 950 frames of varying facial ex-

pressions of 2120 (40 x 53) pixels. After reduction, points

close to each other in the reduced space correspond to simi-

lar facial expressions. The green and red paths correspond to

the green and red sequence of frames.

AASK data reduced to 3-D with DM (Gaussian kernel; left: timesteps = 2, σ = 6; right: timesteps = 2, σ = 20)

WDBC data reduced to 2-D (left) and 1-D (right) with DM (Gaussian kernel; timesteps = 3, σ = 1)

•

•

Coifman, Ronald R., and Stéphane Lafon. "Diffusion maps." Applied and computational harmonic analy-

sis 21.1 (2006): 5-30.

Isaacs, J.C., "Diffusion map kernel analysis for target classification," OCEANS 2009, MTS/IEEE Biloxi - Marine

Technology for Our Future: Global and Local Challenges, vol., no., pp.1,7, 26-29 Oct. 2009

Roweis, Sam T., and Lawrence K. Saul. "Nonlinear dimensionality reduction by locally linear embedding." Sci-

ence 290.5500 (2000): 2323-2326.

Shlens, Jonathon. "A tutorial on principal component analysis." Systems Neurobiology Laboratory, University

of California at San Diego (2005).

Van Der Maaten, Laurens. Matlab Toolbox for Dimensionality Reduction. Delft University of Technology. Vers.

0.8.1. Delft University of Technology, Mar. 2013. Web. 16 July 2013.

References Randomly initialized seeds Surviving seeds counting k

Dimension Gaussian Laplacian Polynomial

5251 to 2 Acc: 0.519 Acc: 0.517 Acc: 0.560

Se: 0.492 Se: 0.522 Se: 0.535

Sp: 0.553 Sp: 0.523 Sp: 0.598

5251 to 3 Acc: 0.525 Acc: 0.527 Acc: 0.492

Se: 0.468 Se: 0.537 Se: 0.486

Sp: 0.592 Sp: 0.527 Sp: 0.509

5251 to 4 Acc: 0.529 Acc: 0.562 Acc: 0.526

Se: 0.476 Se: 0.567 Se: 0.570

Sp: 0.590 Sp: 0.571 Sp: 0.495

5251 to 5 Acc: 0.508 Acc: 0.530 Acc: 0.518

Se: 0.479 Se: 0.558 Se: 0.532

Sp: 0.548 Sp: 0.514 Sp: 0.516

5251 to 6 Acc: 0.537 Acc: 0.527 Acc: 0.549

Se: 0.460 Se: 0.542 Se: 0.549

Sp: 0.620 Sp: 0.524 Sp: 0.564

5251 to 7 Acc: 0.534 Acc: 0.542 Acc: 0.534

Se: 0.469 Se: 0.577 Se: 0.569

Sp: 0.606 Sp: 0.521 Sp: 0.513

5251 to 8 Acc: 0.545 Acc: 0.520 Acc: 0.519

Se: 0.496 Se: 0.579 Se: 0.602

Sp: 0.603 Sp: 0.473 Sp: 0.451

2013 amalthea reu program analysis of medical data using …rec2111/fit-poster.pdf · 2013. 12....

Documents