2013 amalthea reu program analysis of medical data using …rec2111/fit-poster.pdf · 2013. 12....
TRANSCRIPT
2013 AMALTHEA REU Program
Analysis of Medical Data Using Dimensionality Reduction Techniques Robert E. Colgana, David E. Gutierrezb, Jugesh Sundramc, Gnana Bhaskar Tenalid
a Computer Science, Columbia University, b Computer Science, Florida State University, c Mechanical Engineering, Florida Institute of Technology, d Mathematical Sciences, Florida Institute of Technology
This material is based upon work/research supported in part by the
National Science Foundation under Grant No. 1263011.
W e studied a dataset from the African American Study of Kidney Disease and Hypertension (AASK), consisting of 116 instances of 5251 fea-
tures (courtesy: Dr. M. Lipkowitz of Georgetown University Hospital and Dr. M. Subasi of FIT). All patients in the dataset suffer from Chronic Kidney Disease (CKD). Patients are clas-sified as either slow or fast progressors. Features corre-spond to serum proteomic levels (peaks extracted from raw SELDI-tof mass spectrometry data).
Applying k-nearest neigh-bors (k = 3) to the unre-duced data yields an over-all accuracy of 0.58, with specificity 0.551 and sen-sitivity 0.623. The table at right shows overall accu-racy (Acc), sensitivity (Se), and specificity (Sp) for k-nearest neighbors on DM-reduced data using sever-al kernel functions (timesteps = 2).
Kidney Dataset
NLDR techniques such as DM can effectively transform high-dimensional medical data to as few as one dimension while maintaining its important characteristics, allowing us to classi-fy data points with similar accuracy as on the raw data. This saves computational time and space.
DR methods involve feature transformation. Therefore, in most cases an interpretation of the reduced data may not be possible in terms of the original features. However, certain im-portant biomarkers could be devised based on the reduced data.
Classification of certain data can be complicated due to its in-trinsic features. DR methods can improve results of classifica-tion algorithms.
Future work: study the DR methods on kidney datasets to evolve suitable biomarkers.
Conclusions
H igh-dimensional data can be difficult to analyze, al-most impossible to visualize, and expensive to process and store due to the so called “curse of di-
mensionality.” In many cases, the high-dimensional data points may all lie on or close to a much lower-dimensional surface, or manifold, whereby the intrinsic dimensionality of the data may be much lower. In that case, the data could be described with fewer dimensions, allowing us to mitigate the curse of dimensionality. Transforming the high-dimensional representation of the data to a lower dimen-sion without losing important information is the central problem of dimensionality reduction (DR). Many methods of DR have been developed, including classical (linear) Princi-pal Component Analysis (PCA) and newer (nonlinear) meth-ods such as Diffusion Maps (DM). Most methods perform well on some types of data but poorly on others. We applied different methods to medical data, including breast tissue tumor data and kidney proteomics data. To evaluate the performance of the reduction method, we attempted to classify the data in the reduced dimension and evaluated the accuracy.
Abstract
W e illustrate the use of DR methods through the following example (Shlens 2005): consid-er the motion of a ball attached to a spring.
We set up three cameras to capture the motion of the ball. Each camera captures an x and y coordinate every frame for N frames. Thus we obtain N data points, each with 6 dimensions.
By applying a dimensionality reduction technique, we can uncover the one-dimensional representation of the ball’s motion, as shown above. This reduced data better reflects the motion of the ball, and also is easy to store and process.
R educe high-dimensional data to a lower-dimensional space, while maintaining the most important characteristics, in order to under-
stand its underlying structure, visualize and process it efficiently.
Objective
Dimensionality Reduction
A A B B C C
A A B B C C
N
A
N
A
N
B
N
B
N
C
N
C
Dimensionality Reduction
Camera C
Camera B
Camera A
Camera A Camera B Camera C
T he Wisconsin Diagnostic Breast Cancer (WDBC) da-taset consists of 569 data points classified as malig-nant or benign. Each instance contains 30 features
describing characteristics such as radius, perimeter, texture, and smoothness of the cell nuclei, taken from images of fine-needle aspirates of breast tissue. We tested the classifica-tion accuracy of k-nearest neighbors on the original data and the data after reducing it to 3-D, 2-D, and 1-D with diffu-sion maps.
The table to the right shows the overall clas-sification accuracy, sensitivity (proportion of malignant tumors correctly identified), and specificity (proportion of benign tumors correctly identified), with k = 11.
Breast Cancer Dataset
Dimensions Accuracy Sensitivity Specificity
30 (Original) 0.933 0.874 0.968
30 to 3 0.928 0.852 0.973
30 to 2 0.913 0.837 0.959
30 to 1 0.908 0.835 0.952
I n cluster analysis, most clustering algorithms like k-means depend on the number of clusters k as input. To determine the optimal k, a “cluster counting” algorithm
such as gap statistics or X-means is employed.
A novel clustering algorithm, Redundant Seed Clustering (RSC), is proposed which returns k, the best count of clus-ters, and simultaneously partitions the data into k clusters by returning the cluster centroids and class labels.
Redundant Seed Clustering
A few of the DR techniques we studied are shown below.
DR Methods
Dimensionality reduction
Linear methods
Principal Component
Analysis (PCA)
Independent Component
Analysis (ICA)
Nonlinear methods
Locally-Linear Embedding
(LLE)
Diffusion Maps (DM)
Kernel PCA (KPCA)
T he intrinsic geometry of data points lying on a non-linear manifold is better captured by applying non-linear dimensionality reduction (NLDR) methods
than linear methods. This point is highlighted by applying PCA (linear DR method) and LLE (NLDR method) on the “Swiss roll” dataset.
Nonlinear DR
Original Swiss roll in 3-D Swiss roll reduced to 2-D by PCA Swiss roll reduced to 2-D by LLE
NLDR Methods Diffusion Maps attempts to discover the underlying structure of the data by considering random walks along the surface of the data manifold. It maps points into a “diffusion space” where distance is based on the probability of paths.
Locally Linear Embedding re-constructs each data point from a linear combination of its k nearest neighbors. Alt-hough the data may lie on a nonlinear manifold, LLE as-sumes that each point and its k nearest neighbors lie on or close to a linear manifold at a local level.
To understand the performance of NLDR methods, we tested
LLE (left) and DM (right) on 950 frames of varying facial ex-
pressions of 2120 (40 x 53) pixels. After reduction, points
close to each other in the reduced space correspond to simi-
lar facial expressions. The green and red paths correspond to
the green and red sequence of frames.
AASK data reduced to 3-D with DM (Gaussian kernel; left: timesteps = 2, σ = 6; right: timesteps = 2, σ = 20)
WDBC data reduced to 2-D (left) and 1-D (right) with DM (Gaussian kernel; timesteps = 3, σ = 1)
•
•
Coifman, Ronald R., and Stéphane Lafon. "Diffusion maps." Applied and computational harmonic analy-
sis 21.1 (2006): 5-30.
Isaacs, J.C., "Diffusion map kernel analysis for target classification," OCEANS 2009, MTS/IEEE Biloxi - Marine
Technology for Our Future: Global and Local Challenges, vol., no., pp.1,7, 26-29 Oct. 2009
Roweis, Sam T., and Lawrence K. Saul. "Nonlinear dimensionality reduction by locally linear embedding." Sci-
ence 290.5500 (2000): 2323-2326.
Shlens, Jonathon. "A tutorial on principal component analysis." Systems Neurobiology Laboratory, University
of California at San Diego (2005).
Van Der Maaten, Laurens. Matlab Toolbox for Dimensionality Reduction. Delft University of Technology. Vers.
0.8.1. Delft University of Technology, Mar. 2013. Web. 16 July 2013.
References Randomly initialized seeds Surviving seeds counting k
Dimension Gaussian Laplacian Polynomial
5251 to 2 Acc: 0.519 Acc: 0.517 Acc: 0.560
Se: 0.492 Se: 0.522 Se: 0.535
Sp: 0.553 Sp: 0.523 Sp: 0.598
5251 to 3 Acc: 0.525 Acc: 0.527 Acc: 0.492
Se: 0.468 Se: 0.537 Se: 0.486
Sp: 0.592 Sp: 0.527 Sp: 0.509
5251 to 4 Acc: 0.529 Acc: 0.562 Acc: 0.526
Se: 0.476 Se: 0.567 Se: 0.570
Sp: 0.590 Sp: 0.571 Sp: 0.495
5251 to 5 Acc: 0.508 Acc: 0.530 Acc: 0.518
Se: 0.479 Se: 0.558 Se: 0.532
Sp: 0.548 Sp: 0.514 Sp: 0.516
5251 to 6 Acc: 0.537 Acc: 0.527 Acc: 0.549
Se: 0.460 Se: 0.542 Se: 0.549
Sp: 0.620 Sp: 0.524 Sp: 0.564
5251 to 7 Acc: 0.534 Acc: 0.542 Acc: 0.534
Se: 0.469 Se: 0.577 Se: 0.569
Sp: 0.606 Sp: 0.521 Sp: 0.513
5251 to 8 Acc: 0.545 Acc: 0.520 Acc: 0.519
Se: 0.496 Se: 0.579 Se: 0.602
Sp: 0.603 Sp: 0.473 Sp: 0.451