2013 amalthea reu program analysis of medical data using …rec2111/fit-poster.pdf · 2013. 12....

1
2013 AMALTHEA REU Program Analysis of Medical Data Using Dimensionality Reduction Techniques Robert E. Colgan a , David E. Guerrez b , Jugesh Sundram c , Gnana Bhaskar Tenali d a Computer Science, Columbia University, b Computer Science, Florida State University, c Mechanical Engineering, Florida Instute of Technology, d Mathemacal Sciences, Florida Instute of Technology This material is based upon work/research supported in part by the National Science Foundation under Grant No. 1263011. W e studied a dataset from the African American Study of Kidney Disease and Hypertension (AASK), consisng of 116 instances of 5251 fea- tures (courtesy: Dr. M. Lipkowitz of Georgetown University Hospital and Dr. M. Subasi of FIT). All paents in the dataset suffer from Chronic Kidney Disease (CKD). Paents are clas- sified as either slow or fast progressors. Features corre- spond to serum proteomic levels (peaks extracted from raw SELDI-tof mass spectrometry data). Applying k-nearest neigh- bors (k = 3) to the unre- duced data yields an over- all accuracy of 0.58, with specificity 0.551 and sen- sivity 0.623. The table at right shows overall accu- racy (Acc), sensivity (Se), and specificity (Sp) for k- nearest neighbors on DM- reduced data using sever- al kernel funcons (mesteps = 2). Kidney Dataset NLDR techniques such as DM can effecvely transform high- dimensional medical data to as few as one dimension while maintaining its important characteriscs, allowing us to classi- fy data points with similar accuracy as on the raw data. This saves computaonal me and space. DR methods involve feature transformaon. Therefore, in most cases an interpretaon of the reduced data may not be possible in terms of the original features. However, certain im- portant biomarkers could be devised based on the reduced data. Classificaon of certain data can be complicated due to its in- trinsic features. DR methods can improve results of classifica- on algorithms. Future work: study the DR methods on kidney datasets to evolve suitable biomarkers. Conclusions H igh-dimensional data can be difficult to analyze, al- most impossible to visualize, and expensive to process and store due to the so called “curse of di- mensionality.” In many cases, the high-dimensional data points may all lie on or close to a much lower-dimensional surface, or manifold, whereby the intrinsic dimensionality of the data may be much lower. In that case, the data could be described with fewer dimensions, allowing us to migate the curse of dimensionality. Transforming the high- dimensional representaon of the data to a lower dimen- sion without losing important informaon is the central problem of dimensionality reducon (DR). Many methods of DR have been developed, including classical (linear) Princi- pal Component Analysis (PCA) and newer (nonlinear) meth- ods such as Diffusion Maps (DM). Most methods perform well on some types of data but poorly on others. We applied different methods to medical data, including breast ssue tumor data and kidney proteomics data. To evaluate the performance of the reducon method, we aempted to classify the data in the reduced dimension and evaluated the accuracy. Abstract W e illustrate the use of DR methods through the following example (Shlens 2005): consid- er the moon of a ball aached to a spring. We set up three cameras to capture the moon of the ball. Each camera captures an x and y coordinate every frame for N frames. Thus we obtain N data points, each with 6 dimensions. By applying a dimensionality reducon technique, we can uncover the one-dimensional representaon of the ball’s moon, as shown above. This reduced data beer reflects the moon of the ball, and also is easy to store and process. R educe high-dimensional data to a lower- dimensional space, while maintaining the most important characteriscs, in order to under- stand its underlying structure, visualize and process it efficiently. Objective Dimensionality Reduction A A B B C C A A B B C C N A N A N B N B N C N C Dimensionality Reducon Camera C Camera B Camera A Camera A Camera B Camera C T he Wisconsin Diagnosc Breast Cancer (WDBC) da- taset consists of 569 data points classified as malig- nant or benign. Each instance contains 30 features describing characteriscs such as radius, perimeter, texture, and smoothness of the cell nuclei, taken from images of fine -needle aspirates of breast ssue. We tested the classifica- on accuracy of k-nearest neighbors on the original data and the data aſter reducing it to 3-D, 2-D, and 1-D with diffu- sion maps. The table to the right shows the overall clas- sificaon accuracy, sensivity (proporon of malignant tumors correctly idenfied), and specificity (proporon of benign tumors correctly idenfied), with k = 11. Breast Cancer Dataset Dimensions Accuracy Sensitivity Specificity 30 (Original) 0.933 0.874 0.968 30 to 3 0.928 0.852 0.973 30 to 2 0.913 0.837 0.959 30 to 1 0.908 0.835 0.952 I n cluster analysis, most clustering algorithms like k- means depend on the number of clusters k as input. To determine the opmal k, a “cluster counng” algorithm such as gap stascs or X-means is employed. A novel clustering algorithm, Redundant Seed Clustering (RSC), is proposed which returns k, the best count of clus- ters, and simultaneously parons the data into k clusters by returning the cluster centroids and class labels. Redundant Seed Clustering A few of the DR techniques we studied are shown below. DR Methods Dimensionality reduction Linear methods Principal Component Analysis (PCA) Independent Component Analysis (ICA) Nonlinear methods Locally-Linear Embedding (LLE) Diffusion Maps (DM) Kernel PCA (KPCA) T he intrinsic geometry of data points lying on a non- linear manifold is beer captured by applying non- linear dimensionality reducon (NLDR) methods than linear methods. This point is highlighted by applying PCA (linear DR method) and LLE (NLDR method) on the “Swiss roll” dataset. Nonlinear DR Original Swiss roll in 3-D Swiss roll reduced to 2-D by PCA Swiss roll reduced to 2-D by LLE NLDR Methods Diffusion Maps aempts to discover the underlying structure of the data by considering random walks along the surface of the data manifold. It maps points into a “diffusion space” where distance is based on the probability of paths. Locally Linear Embedding re- constructs each data point from a linear combinaon of its k nearest neighbors. Alt- hough the data may lie on a nonlinear manifold, LLE as- sumes that each point and its k nearest neighbors lie on or close to a linear manifold at a local level. To understand the performance of NLDR methods, we tested LLE (leſt) and DM (right) on 950 frames of varying facial ex- pressions of 2120 (40 x 53) pixels. Aſter reducon, points close to each other in the reduced space correspond to simi- lar facial expressions. The green and red paths correspond to the green and red sequence of frames. AASK data reduced to 3-D with DM (Gaussian kernel; leſt: mesteps = 2, σ = 6; right: mesteps = 2, σ = 20) WDBC data reduced to 2-D (leſt) and 1-D (right) with DM (Gaussian kernel; mesteps = 3, σ = 1) Coifman, Ronald R., and Stéphane Lafon. "Diffusion maps." Applied and computaonal harmonic analy- sis 21.1 (2006): 5-30. Isaacs, J.C., "Diffusion map kernel analysis for target classificaon," OCEANS 2009, MTS/IEEE Biloxi - Marine Technology for Our Future: Global and Local Challenges, vol., no., pp.1,7, 26-29 Oct. 2009 Roweis, Sam T., and Lawrence K. Saul. "Nonlinear dimensionality reducon by locally linear embedding." Sci- ence 290.5500 (2000): 2323-2326. Shlens, Jonathon. "A tutorial on principal component analysis." Systems Neurobiology Laboratory, University of California at San Diego (2005). Van Der Maaten, Laurens. Matlab Toolbox for Dimensionality Reducon. Delſt University of Technology. Vers. 0.8.1. Delſt University of Technology, Mar. 2013. Web. 16 July 2013. References Randomly inialized seeds Surviving seeds counng k Dimension Gaussian Laplacian Polynomial 5251 to 2 Acc: 0.519 Acc: 0.517 Acc: 0.560 Se: 0.492 Se: 0.522 Se: 0.535 Sp: 0.553 Sp: 0.523 Sp: 0.598 5251 to 3 Acc: 0.525 Acc: 0.527 Acc: 0.492 Se: 0.468 Se: 0.537 Se: 0.486 Sp: 0.592 Sp: 0.527 Sp: 0.509 5251 to 4 Acc: 0.529 Acc: 0.562 Acc: 0.526 Se: 0.476 Se: 0.567 Se: 0.570 Sp: 0.590 Sp: 0.571 Sp: 0.495 5251 to 5 Acc: 0.508 Acc: 0.530 Acc: 0.518 Se: 0.479 Se: 0.558 Se: 0.532 Sp: 0.548 Sp: 0.514 Sp: 0.516 5251 to 6 Acc: 0.537 Acc: 0.527 Acc: 0.549 Se: 0.460 Se: 0.542 Se: 0.549 Sp: 0.620 Sp: 0.524 Sp: 0.564 5251 to 7 Acc: 0.534 Acc: 0.542 Acc: 0.534 Se: 0.469 Se: 0.577 Se: 0.569 Sp: 0.606 Sp: 0.521 Sp: 0.513 5251 to 8 Acc: 0.545 Acc: 0.520 Acc: 0.519 Se: 0.496 Se: 0.579 Se: 0.602 Sp: 0.603 Sp: 0.473 Sp: 0.451

Upload: others

Post on 16-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 2013 AMALTHEA REU Program Analysis of Medical Data Using …rec2111/FIT-poster.pdf · 2013. 12. 9. · 2013 AMALTHEA REU Program Analysis of Medical Data Using Dimensionality Reduction

2013 AMALTHEA REU Program

Analysis of Medical Data Using Dimensionality Reduction Techniques Robert E. Colgana, David E. Gutierrezb, Jugesh Sundramc, Gnana Bhaskar Tenalid

a Computer Science, Columbia University, b Computer Science, Florida State University, c Mechanical Engineering, Florida Institute of Technology, d Mathematical Sciences, Florida Institute of Technology

This material is based upon work/research supported in part by the

National Science Foundation under Grant No. 1263011.

W e studied a dataset from the African American Study of Kidney Disease and Hypertension (AASK), consisting of 116 instances of 5251 fea-

tures (courtesy: Dr. M. Lipkowitz of Georgetown University Hospital and Dr. M. Subasi of FIT). All patients in the dataset suffer from Chronic Kidney Disease (CKD). Patients are clas-sified as either slow or fast progressors. Features corre-spond to serum proteomic levels (peaks extracted from raw SELDI-tof mass spectrometry data).

Applying k-nearest neigh-bors (k = 3) to the unre-duced data yields an over-all accuracy of 0.58, with specificity 0.551 and sen-sitivity 0.623. The table at right shows overall accu-racy (Acc), sensitivity (Se), and specificity (Sp) for k-nearest neighbors on DM-reduced data using sever-al kernel functions (timesteps = 2).

Kidney Dataset

NLDR techniques such as DM can effectively transform high-dimensional medical data to as few as one dimension while maintaining its important characteristics, allowing us to classi-fy data points with similar accuracy as on the raw data. This saves computational time and space.

DR methods involve feature transformation. Therefore, in most cases an interpretation of the reduced data may not be possible in terms of the original features. However, certain im-portant biomarkers could be devised based on the reduced data.

Classification of certain data can be complicated due to its in-trinsic features. DR methods can improve results of classifica-tion algorithms.

Future work: study the DR methods on kidney datasets to evolve suitable biomarkers.

Conclusions

H igh-dimensional data can be difficult to analyze, al-most impossible to visualize, and expensive to process and store due to the so called “curse of di-

mensionality.” In many cases, the high-dimensional data points may all lie on or close to a much lower-dimensional surface, or manifold, whereby the intrinsic dimensionality of the data may be much lower. In that case, the data could be described with fewer dimensions, allowing us to mitigate the curse of dimensionality. Transforming the high-dimensional representation of the data to a lower dimen-sion without losing important information is the central problem of dimensionality reduction (DR). Many methods of DR have been developed, including classical (linear) Princi-pal Component Analysis (PCA) and newer (nonlinear) meth-ods such as Diffusion Maps (DM). Most methods perform well on some types of data but poorly on others. We applied different methods to medical data, including breast tissue tumor data and kidney proteomics data. To evaluate the performance of the reduction method, we attempted to classify the data in the reduced dimension and evaluated the accuracy.

Abstract

W e illustrate the use of DR methods through the following example (Shlens 2005): consid-er the motion of a ball attached to a spring.

We set up three cameras to capture the motion of the ball. Each camera captures an x and y coordinate every frame for N frames. Thus we obtain N data points, each with 6 dimensions.

By applying a dimensionality reduction technique, we can uncover the one-dimensional representation of the ball’s motion, as shown above. This reduced data better reflects the motion of the ball, and also is easy to store and process.

R educe high-dimensional data to a lower-dimensional space, while maintaining the most important characteristics, in order to under-

stand its underlying structure, visualize and process it efficiently.

Objective

Dimensionality Reduction

A A B B C C

A A B B C C

N

A

N

A

N

B

N

B

N

C

N

C

Dimensionality Reduction

Camera C

Camera B

Camera A

Camera A Camera B Camera C

T he Wisconsin Diagnostic Breast Cancer (WDBC) da-taset consists of 569 data points classified as malig-nant or benign. Each instance contains 30 features

describing characteristics such as radius, perimeter, texture, and smoothness of the cell nuclei, taken from images of fine-needle aspirates of breast tissue. We tested the classifica-tion accuracy of k-nearest neighbors on the original data and the data after reducing it to 3-D, 2-D, and 1-D with diffu-sion maps.

The table to the right shows the overall clas-sification accuracy, sensitivity (proportion of malignant tumors correctly identified), and specificity (proportion of benign tumors correctly identified), with k = 11.

Breast Cancer Dataset

Dimensions Accuracy Sensitivity Specificity

30 (Original) 0.933 0.874 0.968

30 to 3 0.928 0.852 0.973

30 to 2 0.913 0.837 0.959

30 to 1 0.908 0.835 0.952

I n cluster analysis, most clustering algorithms like k-means depend on the number of clusters k as input. To determine the optimal k, a “cluster counting” algorithm

such as gap statistics or X-means is employed.

A novel clustering algorithm, Redundant Seed Clustering (RSC), is proposed which returns k, the best count of clus-ters, and simultaneously partitions the data into k clusters by returning the cluster centroids and class labels.

Redundant Seed Clustering

A few of the DR techniques we studied are shown below.

DR Methods

Dimensionality reduction

Linear methods

Principal Component

Analysis (PCA)

Independent Component

Analysis (ICA)

Nonlinear methods

Locally-Linear Embedding

(LLE)

Diffusion Maps (DM)

Kernel PCA (KPCA)

T he intrinsic geometry of data points lying on a non-linear manifold is better captured by applying non-linear dimensionality reduction (NLDR) methods

than linear methods. This point is highlighted by applying PCA (linear DR method) and LLE (NLDR method) on the “Swiss roll” dataset.

Nonlinear DR

Original Swiss roll in 3-D Swiss roll reduced to 2-D by PCA Swiss roll reduced to 2-D by LLE

NLDR Methods Diffusion Maps attempts to discover the underlying structure of the data by considering random walks along the surface of the data manifold. It maps points into a “diffusion space” where distance is based on the probability of paths.

Locally Linear Embedding re-constructs each data point from a linear combination of its k nearest neighbors. Alt-hough the data may lie on a nonlinear manifold, LLE as-sumes that each point and its k nearest neighbors lie on or close to a linear manifold at a local level.

To understand the performance of NLDR methods, we tested

LLE (left) and DM (right) on 950 frames of varying facial ex-

pressions of 2120 (40 x 53) pixels. After reduction, points

close to each other in the reduced space correspond to simi-

lar facial expressions. The green and red paths correspond to

the green and red sequence of frames.

AASK data reduced to 3-D with DM (Gaussian kernel; left: timesteps = 2, σ = 6; right: timesteps = 2, σ = 20)

WDBC data reduced to 2-D (left) and 1-D (right) with DM (Gaussian kernel; timesteps = 3, σ = 1)

Coifman, Ronald R., and Stéphane Lafon. "Diffusion maps." Applied and computational harmonic analy-

sis 21.1 (2006): 5-30.

Isaacs, J.C., "Diffusion map kernel analysis for target classification," OCEANS 2009, MTS/IEEE Biloxi - Marine

Technology for Our Future: Global and Local Challenges, vol., no., pp.1,7, 26-29 Oct. 2009

Roweis, Sam T., and Lawrence K. Saul. "Nonlinear dimensionality reduction by locally linear embedding." Sci-

ence 290.5500 (2000): 2323-2326.

Shlens, Jonathon. "A tutorial on principal component analysis." Systems Neurobiology Laboratory, University

of California at San Diego (2005).

Van Der Maaten, Laurens. Matlab Toolbox for Dimensionality Reduction. Delft University of Technology. Vers.

0.8.1. Delft University of Technology, Mar. 2013. Web. 16 July 2013.

References Randomly initialized seeds Surviving seeds counting k

Dimension Gaussian Laplacian Polynomial

5251 to 2 Acc: 0.519 Acc: 0.517 Acc: 0.560

Se: 0.492 Se: 0.522 Se: 0.535

Sp: 0.553 Sp: 0.523 Sp: 0.598

5251 to 3 Acc: 0.525 Acc: 0.527 Acc: 0.492

Se: 0.468 Se: 0.537 Se: 0.486

Sp: 0.592 Sp: 0.527 Sp: 0.509

5251 to 4 Acc: 0.529 Acc: 0.562 Acc: 0.526

Se: 0.476 Se: 0.567 Se: 0.570

Sp: 0.590 Sp: 0.571 Sp: 0.495

5251 to 5 Acc: 0.508 Acc: 0.530 Acc: 0.518

Se: 0.479 Se: 0.558 Se: 0.532

Sp: 0.548 Sp: 0.514 Sp: 0.516

5251 to 6 Acc: 0.537 Acc: 0.527 Acc: 0.549

Se: 0.460 Se: 0.542 Se: 0.549

Sp: 0.620 Sp: 0.524 Sp: 0.564

5251 to 7 Acc: 0.534 Acc: 0.542 Acc: 0.534

Se: 0.469 Se: 0.577 Se: 0.569

Sp: 0.606 Sp: 0.521 Sp: 0.513

5251 to 8 Acc: 0.545 Acc: 0.520 Acc: 0.519

Se: 0.496 Se: 0.579 Se: 0.602

Sp: 0.603 Sp: 0.473 Sp: 0.451