kluster: an efficient scalable procedure for approximating the number of clusters in unsupervised...

29
kluster: An Efficient Scalable Procedure for Approximating the Number of Clusters in Unsupervised Learning Hossein Estiri, 1,2,3 Behzad A Omran, 4 Shawn N Murphy 1,2,3 1 Harvard Medical School; 2 Massachusetts General Hospital; 3 Partners Healthcare, Boston, MA; 4 Construction System Management, The Ohio State University, Columbus, OH Corresponding Author: Hossein Estiri E-mail: hestiri at mgh dot harvard dot edu PRE-PRINT. Download the final publication from: https://doi.org/10.1016/j.bdr.2018.05.003 Abstract The majority of the clinical observation data stored in large-scale Electronic Health Record (EHR) research data networks are unlabeled. Unsupervised clustering can provide invaluable tools for studying patient sub-groups in these data. Many of the popular unsupervised clustering algorithms are dependent on identifying the number of clusters. Multiple statistical methods are available to approximate the number of clusters in a dataset. However, available methods are computationally inefficient when applied to large amounts of data. Scalable analytical procedures are needed to extract knowledge from large clinical datasets. Using both simulated, clinical, and public data, we developed and tested the kluster procedure for approximating the number of clusters in a large clinical dataset. The kluster procedure iteratively applies four statistical cluster number approximation methods to small subsets of data that were drawn randomly with replacements and recommends the most frequent and mean number of clusters resulted from the iterations as the potential optimum number of clusters. Our results showed that the kluster’s most frequent product that iteratively applies a model-based clustering strategy using Bayesian Information Criterion (BIC) to samples of 200-500 data points, through 100 iterations, offers a reliable and scalable solution for approximating the number of clusters in unsupervised clustering. We provide the kluster procedure as an R package. 1. Introduction The high throughput of Electronic Health Records (EHR) from multi-site clinical data repositories provides numerous opportunities for novel data-driven healthcare

Upload: others

Post on 17-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: kluster: An Efficient Scalable Procedure for Approximating the Number of Clusters in Unsupervised … · representations of the input data to facilitate further prediction and/or

kluster: An Efficient Scalable Procedure for Approximating the Number of Clusters in

Unsupervised Learning Hossein Estiri,1,2,3 Behzad A Omran,4 Shawn N Murphy1,2,3

1Harvard Medical School; 2Massachusetts General Hospital; 3Partners Healthcare, Boston, MA; 4Construction System Management, The Ohio State University, Columbus, OH

Corresponding Author: Hossein Estiri

E-mail: hestiri at mgh dot harvard dot edu

PRE-PRINT. Download the final publication from:

https://doi.org/10.1016/j.bdr.2018.05.003

Abstract

The majority of the clinical observation data stored in large-scale Electronic Health

Record (EHR) research data networks are unlabeled. Unsupervised clustering can provide

invaluable tools for studying patient sub-groups in these data. Many of the popular

unsupervised clustering algorithms are dependent on identifying the number of

clusters. Multiple statistical methods are available to approximate the number of

clusters in a dataset. However, available methods are computationally inefficient

when applied to large amounts of data. Scalable analytical procedures are needed to

extract knowledge from large clinical datasets. Using both simulated, clinical, and

public data, we developed and tested the kluster procedure for approximating the

number of clusters in a large clinical dataset. The kluster procedure iteratively

applies four statistical cluster number approximation methods to small subsets of

data that were drawn randomly with replacements and recommends the most frequent and

mean number of clusters resulted from the iterations as the potential optimum number

of clusters. Our results showed that the kluster’s most frequent product that

iteratively applies a model-based clustering strategy using Bayesian Information

Criterion (BIC) to samples of 200-500 data points, through 100 iterations, offers a

reliable and scalable solution for approximating the number of clusters in

unsupervised clustering. We provide the kluster procedure as an R package.

1. Introduction

The high throughput of Electronic Health Records (EHR) from multi-site clinical data

repositories provides numerous opportunities for novel data-driven healthcare

Page 2: kluster: An Efficient Scalable Procedure for Approximating the Number of Clusters in Unsupervised … · representations of the input data to facilitate further prediction and/or

pre-print https://doi.org/10.1016/j.bdr.2018.05.003

discovery. EHR data contain unlabeled clinical observations (e.g., laboratory result

values) that can be used to characterize patients with similar phenotypic

characteristics, using unsupervised learning. In healthcare research, unsupervised

learning has been applied for clustering and/or dimensionality reduction. In

unsupervised learning, the machine develops a formal framework to build

representations of the input data to facilitate further prediction and/or decision

making [1]. The goal in unsupervised clustering is to partition data points into

clusters with high intra-class similarities and low inter-class similarities [2,3].

Unsupervised learning is widely used for applications in computer vision, and in

particular for image segmentation. In healthcare research, unsupervised clustering

has been applied to tasks such as image/tissue segmentation and disease/tumor subtype

clustering, dimensionality reduction, but more commonly it has been used in genomics

for gene/cell expression and RNA sequencing.

Many of the popular unsupervised clustering algorithms (e.g., k-means) are dependent

on setting initial parameters, most importantly the number of clusters, k. Initial

parameters play a key role in determining intra-cluster cohesion (compactness) and

inter-cluster separation (isolation) of an unsupervised clustering algorithm.

Initializing the number of clusters for the unsupervised clustering algorithm to begin

with is a challenging problem, for which available solutions are often ad hoc or based

on expert judgment [4–6]. Over the past few decades, the statistics literature has

presented different solutions that apply different quantitative indices to this

problem. Some of the notable statistical solutions are the Calinski and Harabasz index

[7], silhouette statistic [8,9], gap statistic [10], and the model-based approach

using approximate Bayes factor [6,11]. In addition, iterative clustering algorithms

such as the Affinity Propagation algorithm [12], PAM (Partitioning Around Medoids)

[8], and Gaussian-means (G-means) [5] have also been used in the Machine Learning

community for identifying number of clusters in a dataset.

These statistical approaches primarily compare the result of clustering with different

cluster numbers and recommend the best number of the clusters for a dataset. Further,

these techniques were mostly developed for conventional statistical analysis, where

the number of data points does not often exceed a few thousand. As a result, available

statistical approaches either involve making strong parametric assumptions, are

computation-intensive, or both [4]. Especially dealing with large amounts of data,

available statistical solutions are computationally inefficient. Although this is a

general issue across the board, intensive computing requirements to conduct

unsupervised clustering becomes a more prominent issue in clinical research

settings (e.g., research data networks and academic institutions), where

computational capacities are often limited. Applying unsupervised clustering to

Page 3: kluster: An Efficient Scalable Procedure for Approximating the Number of Clusters in Unsupervised … · representations of the input data to facilitate further prediction and/or

pre-print https://doi.org/10.1016/j.bdr.2018.05.003

large amounts of clinical observations data requires scalable methods for identifying

the number of clusters. In this work, we test and present an efficient scalable

procedure, kluster, for approximating the number of clusters in unsupervised

clustering. We have made kluster available as an R package.

2. Material and methods

Selection of number of clusters, k, directly impacts the clustering “accuracy”

criteria, including intra-cluster cohesion and inter-cluster separation. Intuitively,

increasing the initial number of clusters should decrease the clustering error. The

highest error is obtained when the data is clustered into only one partition (i.e.,

maximum compression, minimum accuracy) and the lowest error is when k equals the

number of data points (i.e., minimum compression, maximum accuracy). When prior expert

judgement is unavailable, the optimal choice for the number of clusters can be obtained

at a balanced representation of the data between the minimum and maximum compression

[13]. Multiple statistical approaches have been developed to approximate the number

of clusters. Almost all of the available methods use different statistics for

evaluating clustering performances iteratively over different cluster numbers.

Silhouette coefficient is a measure of cluster assignment accuracy based on comparing

of “tightness” (how far a point is from other points in the same cluster) and

“separation” (how close a point is to its neighboring clusters) [14]. Through

iterative clustering over a range of cluster numbers, an optimal k should maximize

the average silhouette coefficient. [9] The Elbow method is another iterative approach

for identifying the optimal k. In the Elbow method, the total within-cluster sum of

square (WSS) is calculated for each k through iterative clustering with different

cluster numbers. The optimum number of clusters is identified by plotting the WSS

versus the cluster numbers and finding the location of a bend (knee) in the plot. Two

main problems with the Elbow method (other than being computationally intensive) are

that it still requires expert judgement (it is often genuinely ambiguous to identify

number of clusters, even visually), and even if there is expert judgement available,

sometimes there is no clear elbow. Furthermore, it is often genuinely ambiguous to

identify the number of clusters, even visually. The gap statistic method was developed

to standardize comparison in the Elbow method. The gap statistic iteratively compares

normalized within-cluster sums of squares against a null data distribution with no

obvious clustering, and for each k, it compares the WSS with their expected values

under null distribution. The optimal k is where the difference is farthest below the

null distribution curve [10].

Page 4: kluster: An Efficient Scalable Procedure for Approximating the Number of Clusters in Unsupervised … · representations of the input data to facilitate further prediction and/or

pre-print https://doi.org/10.1016/j.bdr.2018.05.003

The Bayesian Information Criterion (BIC) and the Calinski and Harabasz index (CAL)

methods are also popular index-based methods, meaning that they aim to maximize

indices computed through iterative clustering. BIC is another index that can be used

iteratively to approximate the number of clusters [6,11]. This method is part of a

comprehensive model-based clustering strategy, in which a maximum number of clusters

and an initial set of mixture models are applied to hierarchical agglomeration and

expectation–maximization algorithms. The BIC from the resulting models are computed

and an optimal number of clusters, k, is identified from the model with a decisive

maximum of the BIC [6]. We used the implementation of BIC algorithm in R package

‘mclust’ [15]. The Calinski and Harabasz index [7], which is also known as variance

ratio criterion, approximates the optimal number of clusters by maximizing 𝐶𝐻 from equation 1.

Equation 1: 𝐶𝐻# = &'##'(

× *+,,-+,,

where, k is the number of clusters and N is the total number of data points, BGSS is

the overall between-group dispersion and WGSS is the sum of within-cluster dispersions

for all the clusters. We used implementation of CAL algorithm in R package ‘vegan’

[16].

In addition to the index-based methods that can be iteratively computed and optimized,

there are iterative clustering algorithms that do not rely on the initial

approximation of the optimal number of clusters. These algorithms can be used to

identify k for use in other clustering algorithms. Partitioning Around Medoids (PAM)

[9] is a clustering algorithm that can self-identify k, and thus, can be used to

identify the optimal number of clusters. The PAM algorithm searches for a sequence of

centroids for clusters (called medoids) to reduce the effect of outliers. Each

observation is then assigned to its nearest medoids to generate k number of clusters

and a dissimilarity matrix is computed as the basis to re-adjust the medoids. The

process is iterated until there is no change in the medoids [9]. We used the

implementation of PAM algorithm in R package ‘fpc’ [17]. The Affinity Propagation

(AP) algorithm uses measures of similarity between pairs of data points in search of

a “high-quality set of exemplar” data points, by iteratively exchanging real-valued

messages between them. In the AP algorithm, all the data points are considered

potential exemplars and viewed as nodes in the networks. These nodes exchange messages

with each other to generate a better cluster [12]. We use the implementation of AP

algorithm in R package ‘pcluster’ [18].

Regardless of the accuracy of the k optimized by these methods, we argue that they

may not scale to large datasets in their original form. Electronic Health Records

(EHR) on clinical observations for a single patient can add up to hundreds of rows of

Page 5: kluster: An Efficient Scalable Procedure for Approximating the Number of Clusters in Unsupervised … · representations of the input data to facilitate further prediction and/or

pre-print https://doi.org/10.1016/j.bdr.2018.05.003

data. In an average provider, clinical observation data often reach to tens (or

thousands) of millions rows of data. A recommended solution for this issue is to apply

the method to a random sample of the data (as a training set) and use discriminant

analysis to make inferences about the full population [6]. In this paper we develop,

test, and present kluster, a procedure that uses iterative sampling with replacement

to approximate k with application to clinical data.

3. kluster procedure

The Bayesian Information Criterion (BIC), Calinski and Harabasz index (CAL),

Partitioning Around Medoids (PAM), and Affinity Propagation (AP) are popular methods

for approximating the number of clusters, k, which also have well-maintained

implementations in R. We use these four methods as representatives of the available

statistical methods for the purpose of testing the principal hypothesis of this study

and employ them as baseline algorithms to develop and test our proposed procedure.

We argue that applying cluster number approximation methods to an entire dataset

is computationally inefficient, and more importantly, do not scale up to large

datasets. As a result, it is computationally expensive (or currently

impossible) to incorporate such algorithms within recurring unsupervised

learning pipelines in most clinical research institutions. It is also possible

that applying these methods to the entirety of data points will increase the

likelihood of overfitting, and therefore impact the precision of the

recommended clusters number approximation. We hypothesize that employing a

sampling strategy can scale up the cluster number approximation processes

without significantly diminishing performance. To evaluate this hypothesis, we

conducted experimental analyses using the BIC, PAM, AP, and CAL methods. To

conduct simulations for the experimental analyses, we developed functions in

R statistical language, in which we also implemented a procedure of each method

based on iterative sampling. We call this package kluster.

Through kluster, we relax the computational requirements by applying a cluster number

approximation method in iterations to samples of data that were drawn at random and

with replacement. Suppose a population parameter 𝑘, or the number of optimal clusters,

is sought. The kluster procedure produces the 𝑘 as follows:

1. Collect a random sample of size n with replacement from the database, which

yields the data (𝑋(, 𝑋2,… ,𝑋4).

Drawing random samples of n data points from the data will result in 𝑋6s to be i.i.d.

samples from distribution of a random variable 𝑋, and we hypothesize will diminish chances of over-fitting.

Page 6: kluster: An Efficient Scalable Procedure for Approximating the Number of Clusters in Unsupervised … · representations of the input data to facilitate further prediction and/or

pre-print https://doi.org/10.1016/j.bdr.2018.05.003

2. Apply cluster number approximation algorithm 𝜔 to the sampled data (𝑋(, 𝑋2,… , 𝑋4),

to identify number of clusters, 𝑘, in the sample data. Currently, 𝜔 ∈

{𝐵𝐼𝐶, 𝐴𝑃, 𝐶𝐴𝐿, 𝑎𝑛𝑑𝑃𝐴𝑀}.

3. Repeat steps 1 and 2, 𝑖 times to produce a vector of 𝑘s, (𝑘E(, 𝑘E2,… , 𝑘E6).

After the kluster procedure is completed, it provides two products for each of the

four cluster number approximation method: (1) the most frequent approximated number

of clusters, and (2) the mean approximated number for clusters.

4. Calculate the mean and most frequent (mode) in (𝑘E(, 𝑘E2,… , 𝑘E6).

We refer to these products as kluster’s mean and most frequent products respectively.

Equation 1: klusterNsmeanproductonω = mean(𝑘E(, 𝑘E2,… , 𝑘E6)

Equation 2: klusterNsmostfrequentproductonω = mode(𝑘E(,𝑘E2,… , 𝑘E6)

For example, when kluster is applying the BIC method to a user-defined samples of

data and in user-defined number of iterations, it will produce a most frequent product

on BIC and a mean product on BIC. Through the next sections, we describe results of

comparing the approximated number of clusters by each of the four methods and their

corresponding kluster’s products on simulated, clinical, and public datasets.

3.1. Data

We argued that applying the original cluster number approximation methods on the

entire database is computationally inefficient, and therefore, does not scale up to

large amounts of data. Our hypothesis was that employing a sampling strategy

would scale up cluster number approximation without significantly diminishing

performance. We evaluated this hypothesis and developed an efficient scalable

procedure, kluster, to optimize cluster number approximation for unsupervised

clustering. We generated two sets of simulated datasets (first set contains small

datasets and second set contains large datasets) with different cluster compositions

– i.e., different number clusters and separation values – using clusterGeneration

package in R [19] that provides functions for generating random clusters with specific

degrees of separation (value for separation index between any cluster and its nearest

neighboring cluster) and numbers of clusters. Each set of simulation datasets consists

of 91 datasets in comma separated values (csv) format (total of 182 csv files) with

3-15 clusters and 0.1 to 0.7 separation values. Separation values can range between

(−0.999, 0.999), where a higher separation value indicates cluster structure with

more separable clusters (Figure 1). Both the simulated datasets and results are

provided as supplementary files and on the Harvard Dataverse Network and Mendeley

Data (links will be provided after the peer review).

Page 7: kluster: An Efficient Scalable Procedure for Approximating the Number of Clusters in Unsupervised … · representations of the input data to facilitate further prediction and/or

pre-print https://doi.org/10.1016/j.bdr.2018.05.003

Figure 1: Seven example simulated datasets with five clusters and different

separation values.

We then tested our proposed procedure on clinical data from Partners HealthCare’s

Research Patient Data Registry (RPDR) [20], as well as and four public datasets. In

the following sections, we first describe the results from experimental analyses and

then proceed to the results from application of the kluster procedure to clinical and

public data.

4. Results

4.1. Experimental Results on First Set of Simulation Data

The first set of simulation data contained small size datasets with number of rows

ranging from 600 to 3,000. We used these datasets to evaluate performance of the four

cluster number approximation methods as well as their corresponding kluster

implementation.

4.1.1. Processing time for oiginal algorithms

To examine computational intensiveness of running statistical methods for

approximating the optimal number of clusters in data, we stored the processing time

requirement for applying the four methods to the first set of datasets. We used the

results to estimate processing time requirement for running each algorithm on datasets

of up to 100,000 data points, using a third degree regression model (Equation 3).

Equation 3: 𝑦 = 𝛽[ +𝛽(𝑛 +𝛽(𝑛2 +𝛽]𝑛] + 𝜀

Where, 𝑦 is the processing time in minutes and n is the number of data points in the dataset (Figure 2). As the figure shows, the processing time drastically increased

for three of the four examined algorithms as the size of the database increased. The

BIC, AP, and PAM methods respectively required the longest time to approximate the

cluster numbers on the simulated data. According to our estimates, even if enough

memory was available, it would take about 400 minutes for BIC method, and 200 minutes

for the AP and PAM methods to approximate the optimal number of clusters in a 2-

dimensional dataset with 100,000 rows. Nevertheless, the BIC algorithm cannot handle

datasets larger than 50,000 data points. The CAL algorithm was less sensitive to the

size of data. Although, it would still take more than 30 minutes to apply the CAL

Page 8: kluster: An Efficient Scalable Procedure for Approximating the Number of Clusters in Unsupervised … · representations of the input data to facilitate further prediction and/or

pre-print https://doi.org/10.1016/j.bdr.2018.05.003

method to the 100k dataset – we later confirmed this processing time by applying the

CAL method to a dataset with 90,000 data points.

Figure 2: Estimated processing time for running the four cluster number

approximation methods over the size of the database.

4.1.2. Accuracy of cluster number approximation

To evaluate our hypothesis and test our recommended kluster procedure, we conducted

further analyses on the first set of simulated datasets. We conveniently specified

the kluster procedure parameters on using samples of 100 data points (n = 100), and

100 iterations (i=100). We iterated the entire process 25 more times, which resulted

in a total of 25,000 simulation iterations (25 ´ 100). For evaluating the performance

of each method and the kluster procedure across the 91 simulation datasets, we first

looked at the distribution of a normalized index for estimation error. We created the

normalized index by taking the difference between the estimated cluster number and

the actual cluster number and dividing it by the actual cluster number (Equation 4).

Equation 4: 𝑒𝑟𝑟6a = bcd'efef

where,𝑒𝑟𝑟6a is the ratio of the error to the actual number of clusters for algorithm 𝑖

on dataset 𝑗, 𝜂6a is the estimated number of clusters by algorithm 𝑖 on dataset 𝑗, and

Νj is the actual number of clusters in dataset 𝑗. For example, if a method approximates

12 clusters in a 10-cluster dataset, the error ratio (𝑒𝑟𝑟) will be 0.2.

Figure 3 demonstrates the density function for the error ratio of the estimated to

the actual number of cluster (𝑒𝑟𝑟). Vertical lines on the plots show the boundary for 95 percent accuracy. Among the original methods (plots on the left), the CAL and BIC

algorithms had the highest probability of estimating the number of clusters with

Page 9: kluster: An Efficient Scalable Procedure for Approximating the Number of Clusters in Unsupervised … · representations of the input data to facilitate further prediction and/or

pre-print https://doi.org/10.1016/j.bdr.2018.05.003

utmost accuracy. Distribution of the error ratio results from the kluster procedures

(except for kluster’s mean product on CAL) also peaked at 𝑒𝑟𝑟 = 0.

* dotted vertical lines delineate better-than-95% approximation of the number of

clusters

Figure 3: Density functions for the ratio of error to the actual number of clusters

(𝑒𝑟𝑟) by method. To further evaluate the error ratios, we created a heatmap of the frequency of

results with better that 95 percent estimation of the number of clusters across

number of clusters and separation values (Figure 4). Results showed that the

kluster’s most frequent product on BIC performed the top approximation of the number

of clusters across the datasets with different numbers of clusters and separation

values – methods on Figure 4 are ordered based on performance – i.e., frequency of

better-than-95% approximation.

Page 10: kluster: An Efficient Scalable Procedure for Approximating the Number of Clusters in Unsupervised … · representations of the input data to facilitate further prediction and/or

pre-print https://doi.org/10.1016/j.bdr.2018.05.003

* The heatmap on the left shows the ratio of the cluster number approximation accuracies that were betther than 95% over datasets with a known cluster number and different separation value – ratio denominator for each cell is 7. The heatmap on the right shows the ratio of more-than-90% accuracy over datasets with a given separation value and different cluster number – ratio denominator for each cell equals 13.

Figure 4: Goodness of cluster number approximation by method, cluster number,

and separation values.

The heatmap plot across separation values in Figure 4 (plot on the right) shows

that the kluster’s most frequent product on BIC product also held the best overall

performance in approximating the number of clusters in datasets with cluster

different separation values. Nevertheless, to statistically evaluate the difference

in performances obtained from each algorithm, we performed non-parametric

hypothesis testing.

4.1.3. Non-parametric hypothesis testing

To compare the results obtained from implementing the methods on simulation

datasets, we applied non-parametric and post-hoc tests based on the machine learning

experimental scenarios presented by García et al. (2009 and 2010) [21,22] and Santafe

et al. (2015) [23]. We used the ‘scmamp’ package [24] in R to perform hypothesis

testing. The goal is to evaluate whether the error indices (𝑒𝑟𝑟) we obtained in the evaluation process of each algorithm would provide enough statistical evidence that

the algorithms have different performances.

We first applied the Iman and Davenport omnibus test to analyze all the pair-wise

comparisons in order to detect whether at least one of the algorithms performed

differently than the others. The test resulted in a corrected Friedman's chi-

squared of 39.514 and a p-value < 0.0001, indicating that at least one algorithm

performed differently. We therefore proceeded with the post-hoc analysis of the

Page 11: kluster: An Efficient Scalable Procedure for Approximating the Number of Clusters in Unsupervised … · representations of the input data to facilitate further prediction and/or

pre-print https://doi.org/10.1016/j.bdr.2018.05.003

results. Second, we applied the Wilcoxon signed-ranks test [25], with Holm p-value

adjustment method [26], for pair-wise comparison of the kluster procedure results,

and the Friedman post-hoc test [27], with Bergmann and Hommel’s correction [28],

for comparing and ranking all algorithms [24,29].

Applying the Wilcoxon signed-ranks test to the normalized error index (𝑒𝑟𝑟) showed that kluster’s mean and most frequent products are statistically indifferent for

all four algorithms. On BIC and PAM, we found that kluster’s most frequent products

were also statistically indifferent from their corresponding original algorithms’

result – at p-value<0.01 for the BIC method with Holm adjustment. Wilcoxon signed-

ranks test results are provided in Appendix table.

For ranking the algorithms, we computed an absolute accuracy index (𝑎𝑐𝑐) by normalizing

the error ratios (𝑒𝑟𝑟) to range 0 and 1 (Equation 5) – performance improves as the accuracy index approaches 1:

Equation 5: 𝑎𝑐𝑐𝑖 = 1 − |𝑒𝑟𝑟𝑖|−min(|𝑒𝑟𝑟|)

max(|𝑒𝑟𝑟|)−min(|𝑒𝑟𝑟|)

where 𝑒𝑟𝑟 = (𝑒𝑟𝑟(, … , 𝑒𝑟𝑟4) and 𝑎𝑐𝑐6 is the 𝑖rs the absolute accuracy index.

Before performing the Friedman test, we ran Nemenyi’s test on the accuracy index

𝑎𝑐𝑐 to perform an initial ranking, identify the Critical Difference (CD), and create a ranking diagram. CD diagrams effectively summarize algorithm ranking, magnitude

of difference between them, and the significance of observed differences. Any two

algorithms who have an average performance ranking difference greater that the CD

are significantly different [29]. Figure 5 presents the CD diagram of the

algorithms. We obtained a Critical Difference of 1.7507 for the average performance

rankings between the algorithms.

* CD: Critical Difference

Figure 5: CD diagram of the average algorithm performance rankings.

On the CD diagram, each algorithm is placed on an axis according to its average

performance ranking. Those algorithms that exhibit insignificant differences in

their average performance ranking are grouped together using a horizontal line.

Page 12: kluster: An Efficient Scalable Procedure for Approximating the Number of Clusters in Unsupervised … · representations of the input data to facilitate further prediction and/or

pre-print https://doi.org/10.1016/j.bdr.2018.05.003

According to Figure 5, although Calinski (CAL) algorithm has the best average

performance, its performance is not statistically different from kluster’s most

frequent product on BIC and BIC, which respectively had the second and third best

average performances. Although Nemenyi’s test is simple, it is less powerful than

its alternatives, such as the Friedman test, and hence is not often recommended in

practice [24]. We use this test for visualization purpose and filtering out the top

algorithm for Friedman test. The Friedman test’s implementation in ‘scmamp’ package

only take nine variables at the time, so we used the Nemenyi’s test results to

select the top performing kluster performances for Friedman test.

The Friedman test (a.k.a., Friedman two-way analysis of variances by ranks) is a

widely-used non-parametric testing procedure for comparing and ranking observations

obtained more than two related samples [22]. We applied Friedman post-hoc test with

Bergmann and Hommel’s correction to the 𝑎𝑐𝑐 accuracy index obtained for the top six algorithms, plus the original PAM and AP algorithms. The summary of the average

performance ranking of each algorithm over all the dataset is presented in Table

1.

Table 1: Average performance ranking from Friedman post-hoc test

Algorithm BIC

kluster frequent

CAL BIC PAM

kluster frequent

BIC kluster mean

PAM kluster mean

PAM AP

Rank 1 2 3 4 5 6 7 8 average performance ranking

3.835 3.879 4.005 4.225 4.296 4.351 5.197 6.208

The Friedman post-hoc test showed that kluster’s most frequent product on BIC has

the best average performance, but confirms the rankings obtained from the CD diagram

for the remainder of the 8 selected algorithm. To further evaluate significance of

the average performance ranking differences, we studied the Bergmann and Hommel’s

corrected p-values of the Friedman test (Table 2).

Table 2: pair-wise Bergmann and Hommel’s corrected p-values from the Friedman test.

CAL BIC BIC

kluster frequent

BIC kluster mean

PAM kluster frequent

PAM kluster mean

PAM

BIC 1.0000 BIC kluster frequent

1.0000 1.0000

BIC kluster mean

1.0000 1.0000 1.0000

PAM kluster 1.0000 1.0000 1.0000 1.0000

Page 13: kluster: An Efficient Scalable Procedure for Approximating the Number of Clusters in Unsupervised … · representations of the input data to facilitate further prediction and/or

pre-print https://doi.org/10.1016/j.bdr.2018.05.003

frequent

PAM kluster mean

1.0000 1.0000 1.0000 1.0000 1.0000

PAM 0.0042 0.0113 0.0037 0.1439 0.0859 0.2178

AP 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0859 * Insignificant p-values (pairwise-differences) are highlighted in

Bold.

The p-values from the Friedman test also confirm that the average performance

ranking for the top six algorithms are not significantly different from the top-

performing algorithm and each other at p-value <0.05. These results provide support

for the second segment of our hypothesis that applying a sampling strategy does not

diminish the performance of cluster number approximation methods – indeed, we found

that it can improve cluster number approximation.

4.2. Experimental Results on the Second Set of Simulation Data

To evaluate the first segment of our hypothesis that applying a sampling strategy can

scale up the cluster number approximation, we applied the kluster procedure to large

datasets. The second set of simulation data contained large datasets with number of

rows ranging from 90,000 to 2,250,000. We used these datasets to exclusively re-

evaluate the performance of kluster procedure. We performed non-parametric hypothesis

testing to rank kluster performances and evaluate its sensitivity to sample size.

4.2.1. Re-evaluating the kluster procedure

We used the results from applying the kluster procedures to large datasets to re-

evaluate their performance against large datasets. Following the non-parametric

hypothesis procedure from the previous sub-section, we first performed the Nemenyi’s

test on the accuracy index 𝑎𝑐𝑐 to perform an initial ranking, identify the Critical Difference (CD), and create a ranking diagram. Figure 6 presents the CD diagram of

the kluster procedures. We obtained a Critical Difference of 1.1039 for the average

performance rankings between the kluster procedures.

Page 14: kluster: An Efficient Scalable Procedure for Approximating the Number of Clusters in Unsupervised … · representations of the input data to facilitate further prediction and/or

pre-print https://doi.org/10.1016/j.bdr.2018.05.003

* CD: Critical Difference

Figure 6: CD diagram of the average kluster procedure performance rankings.

Similar to the results we obtained from the first set of datasets, kluster’s most

frequent products on BIC, and PAM and their corresponding mean products were

respectively the top four cluster number approximation procedures – none were

significantly different from each other. We then applied Friedman post-hoc test

with Bergmann and Hommel’s correction to the 𝑎𝑐𝑐 accuracy index obtained for the kluster procedures (Table 3).

Table 3: Average performance ranking of kluster procedures from Friedman post-hoc test

Algorithm

BIC kluster frequen

t

PAM kluster frequen

t

BIC kluste

r mean

PAM kluste

r mean

CAL kluster frequen

t

CAL kluste

r mean

AP kluster frequen

t

AP kluste

r mean

Rank 1 2 3 4 5 6 7 8 average performance ranking

3.230 3.368 3.813 3.884 4.065 5.587 5.637 6.412

The Friedman test verified the performance rankings produced by the Nemenyi’s test.

The kluster’s most frequent product on BIC still holds the best performance,

although the very same product on PAM is not significantly worse.

4.2.2. Processing time for the kluster procedure

As we expected, kluster procedures were fast. For example, on a 2,250k dataset, the

kluster procedure on BIC took between 36.99 seconds (with 100 samples), to 176.44

seconds (with 500 samples), and to 444.6 seconds (with 1,000 samples). We evaluated

the processing time for the two kluster procedures with the best accuracy

Page 15: kluster: An Efficient Scalable Procedure for Approximating the Number of Clusters in Unsupervised … · representations of the input data to facilitate further prediction and/or

pre-print https://doi.org/10.1016/j.bdr.2018.05.003

performance, on BIC and PAM algorithms. Because we are comparing two algorithms

across multiple datasets, we first ran Wilcoxon signed-ranks test [25] with Holm

p-value adjustment or processing time recorded from each procedures across the 91

large datasets. A p-value < 0.0001 suggests that the two procedures are

significantly different from each other. On 66 datasets (out of 91, i.e., 72.527

percent) the kluster procedure on BIC was faster than the kluster procedure on PAM.

The adjusted p-value (<0.0001) suggests a significant difference between processing

time when the kluster procedure on BIC is faster than its equivalence on PAM. On

25 datasets (27.472 percent of the large datasets), the kluster procedure on PAM

was faster than the kluster procedure on BIC, in which the difference was also

significant at p-value < 0.0001. Overall, the kluster procedure on BIC is slightly

faster than the kluster procedure on PAM.

4.2.3. Sensitivity of the kluster procedure to sample size

Results of our experiments on the first set of datasets showed that kluster procedures

on BIC and PAM performed better than, or as good as, when their corresponding methods

are applied to the entire dataset. These results were based on 25,000 iterations (100

sampling iterations Î 25 simulation iterations) of samples of 100 data points drawn

with replacement. Sensitivity of the kluster procedure to the size of samples taken

from data is important in setting up a generalizable specification for the kluster

procedure. To test sensitivity of the procedure to sample size, we ran kluster

procedures with 100, 200, 300, 400, 500, and 1,000 samples, on the second set of

datasets.

We focused on the kluster procedure on BIC, as our most efficient and recommended

implementation of the kluster procedure. Similar to the hypothesis testing procedures

that we have been following in the previous sub-sections, we began by running

Nemenyi’s test on the accuracy index 𝑎𝑐𝑐. Figure 7 presents the CD diagram of the kluster procedure on BIC across different sample sizes. We obtained a Critical

Difference of 0.793 for the average performance rankings between the kluster

procedures on different sample sizes.

* CD: Critical Difference

Figure 7: CD diagram of the average kluster procedure on BIC performance rankings

over sample size.

Page 16: kluster: An Efficient Scalable Procedure for Approximating the Number of Clusters in Unsupervised … · representations of the input data to facilitate further prediction and/or

pre-print https://doi.org/10.1016/j.bdr.2018.05.003

As the CD diagram shows, there is essentially no difference in performance ranking of

kluster procedure on BIC across samples of sizes 200 to 500. The sample size 100,

which we used in the experimental analyses on the first set of datasets had the lowest

performance, although still insignificant. The Friedman test also confirmed these

results. With respect to these results, we recommend kluster’ most frequent results

with a sample size between 200 and 500 for approximating the number of clusters in

large datasets.

4.3. Results on Clinical Data

Our results on simulated data recommended that the kluster’s most frequent product on

BIC was promising for application on large amounts of data. To test the utility of

kluster’s products for unsupervised clustering of clinical data from EHRs, we applied

the kluster procedure (100 to 500 random samples with 100 sampling iterations) to

over 320 million rows of data representing 25 clinical observations extracted from

Partners HealthCare’s Research Patient Data Registry (RPDR) [20]. Table 1 presents

the results of kluster’s most frequent product on BIC, along with processing time and

number of rows for each observation. The number of rows for each group of observations

ranged from 1,226 to 34,341,494. Twenty out of the 25 observations had more than

1,000,000 rows of data (24 had more than 100,000 rows of data), which made running

the cluster number identification algorithms on the entire dataset virtually

impossible. Before discussing the accuracy of the results, we were able to complete

the procedure and come up with an approximated number of clusters in datasets with

over 30 million rows of data, in less than two minutes.

Table 4: kluster results (most frequent product on BIC) on RPDR observations data.

Observation kluster*

processing time

(seconds)

rows of data

Human serum albumin** 2 42.347 15,079,716

Calcium 2 54.701 26,975,428

Bicarbonate (HCO3) 2 20.207 740,440 Carbon dioxide, total 2 38.327 29,547,86

4 Chloride 2 30.91 29,412,93

8 HDL cholesterol 2 38.735 152,706 LDL cholesterol 3 23.84 108,075 Total cholesterol 2 38.557 5,822,564 Potassium 2 47.937 30,351,59

2 Albumin 2 102.727 124,714 Sodium 2 22.712 2,421,092

Page 17: kluster: An Efficient Scalable Procedure for Approximating the Number of Clusters in Unsupervised … · representations of the input data to facilitate further prediction and/or

pre-print https://doi.org/10.1016/j.bdr.2018.05.003

Hemoglobin 2 54.458 1,257,063 Basophils [#/Volume] 2 12.07 5,312,086 Basophils/100 Leukocytes 3 28.006 5,327,076 Hemoglobin 2 68.058 34,341,49

4 Lymphocytes/100 Leukocytes 4 45.786 2,100,438 Platelet count 2 7.121 1,226 Mean corpuscular hemoglobin concentration

1 65.273 30,651,862

Mean cell volume (MCV) 2 39.978 30,651,826

Red blood cells (RBC) 1 92.181 30,652,302

BMI 2 41.861 1,255,943 Diastolic Blood Pressure 1 26.354 12,699,44

1 eGFR 15 15.627 4,374,748 Systolic Blood Pressure 2 31.576 12,699,44

1 Weight 3 44.092 11,909,93

2 * kluster’s most frequent product on BIC ** Plots for bolded observations are provided in Figure 7.

The processing times are even more remarkable when the resulted cluster number

approximations also passed the eye test. Unlike the simulated data, we did not have

a preset gold standard number of clusters for EHR observations to calculate acceptable

boundaries. Figure 8 illustrates the [mirrored] probability distribution of eight of

the observations from Table 1. Horizontal axes in Figure 8 plots were transformed to

the square root of observation values for better visualization of often-skewed

distributions. It appears that clinical observation data often has uniform

distribution. Associating these distributions with the simulated datasets, clusters

in clinical observation data were often tight – i.e., separation values were small.

We found that kluster’s most frequent product on BIC approximated two clusters in the

majority of the observations. These observations are often a main body and a few

outliers (e.g., RBC, Human serum albumin, and HCO3). The most number of approximated

clusters belonged to eGFR, which had a long distribution across the horizontal axis.

Page 18: kluster: An Efficient Scalable Procedure for Approximating the Number of Clusters in Unsupervised … · representations of the input data to facilitate further prediction and/or

pre-print https://doi.org/10.1016/j.bdr.2018.05.003

* Horizontal axes represented squared observation values. Vertical axes are

probabilities mirrored around 0 for better visualization.

Figure 8: Density plots of 8 selected observations from RPDR Electronic Health Records.

4.4. Results on Public Data

We applied the kluster procedure to four public datasets. Due to the relatively small

sample size of these datasets (between 150 to 3,168 data samples), we were able to

apply the original methods as well as their kluster procedure to each of these

datasets. Due to small sample sizes, we used the same specification for kluster as we

used on our small size simulation data (i.e., sample size = 100, iterations = 100).

The first dataset was Breast Cancer Wisconsin (Diagnostic)[30]. Features of this

dataset are computed from a digitized image of a fine needle aspirate (FNA), describing

characteristics of a breast mass cell nuclei present in the image. Figure 9 shows a

scatter plot of mean area versus mean texture (standard deviation of gray-scale

values) classified by diagnosis (M = malignant, B = benign). Although the two clusters

of malignant and benign diagnoses are not well-separated, we can distinguish two

clusters in this dataset when organized by mean area and texture.

Basophils/100 Leukocytes Lymphocytes/100 Leukocytes Weight eGFR

HemoglobinBicarbonate (HCO3)Human serum albuminRed blood cells (RBC)

Page 19: kluster: An Efficient Scalable Procedure for Approximating the Number of Clusters in Unsupervised … · representations of the input data to facilitate further prediction and/or

pre-print https://doi.org/10.1016/j.bdr.2018.05.003

Figure 9. Scatter plot of mean area versus mean texture by diagnosis (M = malignant, B = benign) in Breast Cancer Wisconsin (Diagnostic) dataset.

We applied the kluster procedure with 100 random samples taken from the data in 100

iterations. The small sample size was due to the small dataset. Results are presented

in Table 5. Results showed that kluster’s products on PAM were fastest with a perfect

approximation. Kluster’s products on BIC were also perfect in approximating k, but

were slower than the kluster’s products on PAM. From the original methods, applying

the PAM and BIC methods to the entire dataset also produced comparable results with

their respective kluster’s products. However, the processing time to obtain the same

results was 63 times longer for the PAM method and more than 12 times longer for the

BIC method.

Table 5. Results of applying the four cluster number approximation methods and kluster procedure on Breast Cancer Wisconsin dataset.

Method Approximated k

Processing Time

𝝐*

Kluster’s mean product on PAM 2 0.02324 0 Kluster’s most frequent product on PAM

2 0.02324 0

Kluster’s mean product on AP 8 0.03762 6 Kluster’s most frequent product on AP

8 0.03762 6

Kluster’s mean product on BIC 2 0.59856 0 Kluster’s most frequent product on BIC

2 0.59856 0

Kluster’s mean product on Calinski

15 0.90375 13

Kluster’s most frequent product on Calinski

15 0.90375 13

PAM algorithm 2 1.466 0 AP algorithm 17 1.793 15 Calinski algorithm 15 2.024 13 BIC algorithm 2 7.566 0

* The actual number of clusters in the Breast Cancer Wisconsin dataset is 2 and n = 569. Kluster’s products are based on 100 iterations, and samples of

Page 20: kluster: An Efficient Scalable Procedure for Approximating the Number of Clusters in Unsupervised … · representations of the input data to facilitate further prediction and/or

pre-print https://doi.org/10.1016/j.bdr.2018.05.003

100 data points. Data is sorted by processing time. Method(s) with the best results are bolded.

The second public dataset that was the famed Iris Species dataset [31], which contains

150 data samples (50 for each of the three species) and their properties. Figure 10

presents a scatter plot of data based on petal length (cm) and petal width (cm). The

three classes of Iris species including Iris Setosa, Iris Versicolour, and Iris

Virginica are distinguishable in the plot.

Figure 10: Scatter plot of petal length (cm) versus petal width (cm) by Iris

species in the Iris Species dataset

The kluster procedure with the same setting as before (100 random samples taken from

the data in 100 iterations) was applied to the Iris dataset. Table 6 shows the result

of this analysis. For this dataset, the BIC method and its kluster’s products all

produced a perfect approximation. The PAM method and its kluster’s products all had

similar results (three clusters) which were the next best cluster approximations for

this dataset. Overall, all the methods except Calinski had a better approximation

result when the kluster procedure was applied. As expected, kluster’s products had

significantly shorter processing times than their original methods except for the

Calinski method which was expected given the limit number of total sample size (150).

Table 6: Results of applying the four cluster number approximation methods and kluster procedure on the Iris Species dataset.

Method Approximated k

Processing Time

𝝐*

Kluster’s mean product on PAM 2 0.0231 -1 Kluster’s most frequent product on PAM 2 0.0231 -1

Kluster’s mean product on AP 4 0.0298 1 Kluster’s most frequent product on AP 4 0.0298 1

AP algorithm 5 0.04 2

Page 21: kluster: An Efficient Scalable Procedure for Approximating the Number of Clusters in Unsupervised … · representations of the input data to facilitate further prediction and/or

pre-print https://doi.org/10.1016/j.bdr.2018.05.003

PAM algorithm 2 0.06 -1 Kluster’s mean product on BIC 3 0.4544 0 Kluster’s most frequent product on BIC 3 0.4544 0

Calinski algorithm 10 0.5 7 Kluster’s mean product on Calinski 15 0.7439 12

Kluster’s most frequent product on Calinski 15 0.7439 12

BIC algorithm 3 0.81 0 * The actual number of clusters in the Iris Species dataset is 3 and n = 150. Kluster’s products are based on 100 iterations, and samples of 100 data points. Data is sorted by processing time. Method(s) with the best results are bolded.

The third public dataset we used was the Voice Gender dataset [32]. This dataset

consists of 3,168 data samples created to identify the gender of the voice as either

male or female, according to acoustic properties of the voice and speech. Figure 11

shows the scatter plot of meanfun (average of fundamental frequency measured across

acoustic signals) vs modindx (modulation index: Calculated as the accumulated absolute

difference between adjacent measurements of fundamental frequencies divided by the

frequency range). The two clusters of males and females are easily recognizable in

this figure.

Figure 11: Scatter plot of meanfun versus modindx by gender in the Voice Gender

dataset

Similar to the previous public datasets, we applied the kluster procedure using 100

random samples taken from the data in 100 iterations and the result of the evaluation

is presented in Table 7. We found that the PAM algorithm as well as its two kluster’s

products had the perfect prediction of cluster number. The BIC method had a very poor

approximation of clusters with nine clusters – instead of two. However, kluster’s

most frequent products had a perfect approximation. Calinski was the only method with

Page 22: kluster: An Efficient Scalable Procedure for Approximating the Number of Clusters in Unsupervised … · representations of the input data to facilitate further prediction and/or

pre-print https://doi.org/10.1016/j.bdr.2018.05.003

a better approximation of k than its kluster’s products. In terms of processing time

AP, BIC, Calinski, and PAM each had respectively 2,880, 102, 30 and 6,538 times longer

processing time than their kluster’s products.

Table 7. Results of applying the four cluster number approximation methods and kluster procedure on the Voice Gender dataset

Method Approximated k

Processing Time

𝝐*

Kluster’s mean product on PAM 2 0.0343 0 Kluster’s most frequent product on PAM 2 0.0343 0

Kluster’s mean product on AP 11 0.0346 9 Kluster’s most frequent product on AP 11 0.0346 9

Kluster’s mean product on BIC 3 0.7218 1 Kluster’s most frequent product on BIC 2 0.7218 0

Kluster’s mean product on Calinski 14 1.1214 1

2 Kluster’s most frequent product on Calinski 15 1.1214 1

3 Calinski algorithm 4 33.77 2 BIC algorithm 9 73.34 7

AP algorithm 66 99.64 64

PAM algorithm 2 224.25 0 * The actual number of clusters in the Voice Gender dataset is 2 and n = 3168. Kluster’s products are based on 100 iterations, and samples of 100 data points. Data is sorted by processing time. Method(s) with the best results are bolded.

The last dataset is the Pima Indians Diabetes Database [33]. Obtained from the National

Institute of Diabetes and Digestive and Kidney Diseases, the dataset has 768 samples

all from female patients who are at least 21 years old and of Pima Indian heritage.

The scatter plot for Glucose (plasma glucose concentration) vs BMI (body mass index)

are presented in Figure 12. Similar to the Breast Cancer Wisconsin dataset, this

dataset also does not have a clear separation between the two clusters of outcome

based only on a 2-D plot, but the two classes are somewhat recognizable.

Page 23: kluster: An Efficient Scalable Procedure for Approximating the Number of Clusters in Unsupervised … · representations of the input data to facilitate further prediction and/or

pre-print https://doi.org/10.1016/j.bdr.2018.05.003

Figure 12: Scatter plot of glucose versus BMI by outcome in the Pima Indians

Diabetes Database

On the Pima Indians Diabetes Database, we found that the PAM method, its kluster’s

products, and BIC kluster’s products had the perfect approximation (Table 8). In

addition, the kluster procedure improved the accuracy of clustering for the BIC and

AP methods. Calinski was the only method with a better approximation of k than its

kluster’s products. AP, BIC, Calinski, and PAM each had respectively 92, 10, 4, and

120 times longer processing time than their kluster’s products.

Table 8. Results of applying the four cluster number approximation methods and kluster procedure on the Pima Indians Diabetes Database

Method Approximated k

Processing Time

𝝐*

Kluster’s mean product on PAM 2 0.0355 0 Kluster’s most frequent product on PAM 2 0.0355 0

Kluster’s mean product on AP 8 0.0397 6 Kluster’s most frequent product on AP 8 0.0397 6

Kluster’s mean product on BIC 2 0.9902 0 Kluster’s most frequent product on BIC 2 0.9902 0

Kluster’s mean product on Calinski 13 1.0267 11

Kluster’s most frequent product on Calinski 15 1.0267 13

AP algorithm 24 3.64 22 Calinski algorithm 6 3.87 4 PAM algorithm 2 4.27 0 BIC algorithm 3 10.25 1

* The actual number of clusters in the Pima Indians Diabetes Database is two and n = 768. Kluster’s products are based on 100 iterations, and samples of 100 data points. Data is sorted by processing time. Method(s) with the best results are bolded.

Page 24: kluster: An Efficient Scalable Procedure for Approximating the Number of Clusters in Unsupervised … · representations of the input data to facilitate further prediction and/or

pre-print https://doi.org/10.1016/j.bdr.2018.05.003

5. Discussion

Unsupervised learning has been applied to a variety of dimensionality reduction or

clustering tasks in Clinical Informatics. The high throughput of unlabeled data from

multi-site clinical repositories offers new opportunities to apply unsupervised

clustering for characterizing patients into groups with similar phenotypic

characteristics. Applying some of the most popular unsupervised clustering algorithms

(e.g., k-means and its many derivatives) to clinical data is dependent on

initializations, most importantly setting up the number of clusters, k. There are

multiple statistical solutions for approximating the number of clusters in a dataset.

We argued that these methods are computationally inefficient when dealing with large

amounts of clinical data, due to the high likelihood of over-fitting (which results

in over- or under-estimation of the number of clusters) and extensive computing

requirements. We hypothesized and showed that applying a sampling strategy can scale

up the cluster number approximation while improving accuracy. Based on our hypothesis,

we developed a procedure, kluster, which iteratively applies statistical cluster

number approximation methods to samples of data.

Bootstrap methods have been applied to various clustering problems, including

approximation of cluster numbers. For example, bootstrapping has been used for

estimating the clustering instability and then selecting the optimal number of

clusters that maximize clustering stability [34–37]. In the case of Big Data, would

still require clustering to happen on large amounts of data. In addition,

bootstrapping has been applied into defining the optimum number of clusters, using

statistical criterion, such as the Hubert’s gamma statistic [38]. However, in dealing

with unlabeled clinical observation data, extracting a representative training dataset

(e.g., with less than 30 percent of the entire dataset) may still result in a large

dataset. Iterative sampling can provide a scalable solution to this problem.

We tested the kluster procedure on four cluster number approximation method with

simulated data, as well as on clinical observation data. Our results showed that

the kluster’s products were as good as or better than applying their corresponding

cluster number approximation method to the entire dataset. Taking the processing

time and accuracy into account, we found that the kluster’s most frequent product

on the BIC method (the most frequent number of clusters approximated through

kluster’s iterations applying the BIC method to samples of data) performed better

than any other of the four methods, on almost any cluster structure. Testing the

kluster procedure on clinical observation data also verified reliability of the

kluster’s most frequent product on BIC. We also evaluated sensitivity of the

kluster procedure to sample size. Results of our analyses recommend the kluster’s

most frequent product on BIC with between 200 to 500 samples and 100 iterations as

Page 25: kluster: An Efficient Scalable Procedure for Approximating the Number of Clusters in Unsupervised … · representations of the input data to facilitate further prediction and/or

pre-print https://doi.org/10.1016/j.bdr.2018.05.003

an efficient procedure for unsupervised clustering of large scale clinical

datasets.

Currently, we have embedded the kluster procedure functions into an unsupervised

learning pipeline on large clinical observation datasets from the Research Patient

Data Registry (RPDR) that are being utilized by a multi-site clinical data research

network, Accessible Research Commons for Health (ARCH). Using the kluster procedure

has significantly reduced the computational requirement for developing and applying

a variety of unsupervised clustering algorithms that would not have been possible

without the kluster procedure, given available computational resources in clinical

settings.

6. Conclusion

Due to computational limitations, scalable analytical procedures are needed to

extract knowledge from large clinical datasets. Many of the popular unsupervised

clustering algorithms are dependent on pre-identification of the number of clusters,

k. Over the past few decades, the statistics literature has presented different

solutions that apply different quantitative indices to this problem. In the context

of emerging large scale clinical data networks, however, available statistical

methods are computationally inefficient. In this paper we present a simple efficient

procedure, kluster, for identifying the number of clusters in unsupervised

learning. Using two sets of simulation datasets, and clinical and public datasets,

we showed that kluster’s most frequent product using the BIC method on random

samples of 200-500 data points, with 100 times iteration, provides a reliable and

scalable solution for approximating number of clusters in large clinical datasets.

Together, the sampling strategy (i.e., number of samples to take) and the simulation

iterations we applied in the experimental analyses provided us with sufficient

information to test the principal hypothesis of this study. Although we found that

choice of the sample size between 100-1,000 data points may not play a significant

role, further work is required to obtain (or test existence of) best practices for

the number of simulation iteration and the sample size.

Although kluster results are promising, generalizability of its result may require

further evaluation due to two limitations. First, we have only applied four of the

available cluster number identification algorithms into kluster (BIC, PAM, AP,

CAL). Implementation of the four algorithms are available in different R packages,

as cited in this paper. We plan to incorporate more algorithms into kluster R

package in the near future. Nevertheless, with the current four algorithm we were

able to find a scalable solution. Second, the simulation data used for our

assessment of kluster was 2-dimensional and the clinical observation data was 1-

dimensional. Further evidence might be needed to verify the effectiveness of the

kluster procedure against datasets with higher dimensions.

Page 26: kluster: An Efficient Scalable Procedure for Approximating the Number of Clusters in Unsupervised … · representations of the input data to facilitate further prediction and/or

pre-print https://doi.org/10.1016/j.bdr.2018.05.003

In addition to the main kluster functions, we have also developed functions to

compare accuracy and processing time for the kluster procedure and the four cluster

number approximation methods. We provided all of these functions as an R package,

named kluster, through GitHub: https://github.com/hestiri/kluster. We have also

made all the codes and results of the simulations, conducted in this study, publicly

available on the GitHub – https://github.com/hestiri/klusterX. Simulation data

generated in this study is also available on the Harvard Dataverse Network and

Mendeley Data (links will be provided after the peer review).

Acknowledgements

This work was supported in part by the Patient-Centered Outcomes Research Institute

(PCORI) Award (CDRN-1306-04608) for development of the National Patient-Centered

Clinical Research Network, known as PCORnet, NIH R01-HG009174, and the NLM training

grant T15LM007092. The authors are very grateful to the anonymous reviewers for

their valuable suggestions and comments to improve the quality of this paper.

References

[1] Z. Ghahramani, Unsupervised Learning, in: O. Bousquet, U. von Luxburg, G. Rätsch

(Eds.), Adv. Lect. Mach. Learn. ML Summer Sch. 2003, Canberra, Aust. Febr. 2 -

14, 2003, T{ü}bingen, Ger. August 4 - 16, 2003, Revis. Lect., Springer Berlin

Heidelberg, Berlin, Heidelberg, 2004: pp. 72–112. doi:10.1007/978-3-540-28650-

9_5.

[2] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning:

Data Mining, Inference, and Prediction, 2009. doi:10.1007/b94608.

[3] A.K. Jain, Data clustering: 50 years beyond K-means, Pattern Recognit. Lett. 31

(2010) 651–666. doi:10.1016/j.patrec.2009.09.011.

[4] C.A. Sugar, G.M. James, Finding the Number of Clusters in a Dataset, J. Am.

Stat. Assoc. 98 (2003) 750–763. doi:10.1198/016214503000000666.

[5] G. Hamerly, C. Elkan, Learning the k in k means, Adv. Neural Inf. Process. 17

(2004) 1–8. doi:10.1.1.9.3574.

[6] C. Fraley, A.E. Raftery, Model-Based Clustering, Discriminant Analysis, and

Density Estimation, J. Am. Stat. Assoc. 97 (2002) 611–631.

doi:10.1198/016214502760047131.

[7] T. Caliński, J.A. Harabasz, A dendrite method for cluster analysis, Commun.

Stat. 3 (1974) 1–27. doi:10.1080/03610927408827101.

Page 27: kluster: An Efficient Scalable Procedure for Approximating the Number of Clusters in Unsupervised … · representations of the input data to facilitate further prediction and/or

pre-print https://doi.org/10.1016/j.bdr.2018.05.003

[8] L. Kaufman, P.J. Rousseeuw, Clustering by means of medoids, in: Stat. Data Anal.

Based L 1-Norm Relat. Methods. First Int. Conf., 1987: pp. 405–416416.

[9] L. Kaufman, P.J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster

Analysis (Wiley Series in Probability and Statistics), 1990.

http://www.eepe.ethz.ch/cepe/cepe/publications/Muller_ClusterVorlesung_28_1_04

.pdf%5Cnhttp://books.google.com/books?hl=en&lr=&id=YeFQHiikNo0C&oi=fnd&pg=PR11

&dq=Finding+Groups+in+Data+-

+An+introduction+to+Cluster+Analysis&ots=5zp9F4PGxF&sig=SeUYzccb34LjgB8.

[10] R. Tibshirani, G. Walther, T. Hastie, Estimating the number of clusters in a

data set via the gap statistic, J. R. Stat. Soc. Ser. B (Statistical Methodol.

63 (2001) 411–423. doi:10.1111/1467-9868.00293.

[11] C. Fraley, A.E. Raftery, How Many Clusters? Which Clustering Method? Answers Via

Model-Based Cluster Analysis, Comput. J. 41 (1998) 578–588.

doi:10.1093/comjnl/41.8.578.

[12] B.J. Frey, D. Dueck, Clustering by passing messages between data points.,

Science. 315 (2007) 972–976. doi:10.1126/science.1136800.

[13] T. Pinto, G. Santos, L. Marques, T.M. Sousa, I. Pra??a, Z. Vale, S.L. Abreu,

Solar intensity characterization using data-mining to support solar forecasting,

in: Adv. Intell. Syst. Comput., 2015: pp. 193–201. doi:10.1007/978-3-319-19638-

1_22.

[14] P.J. Rousseeuw, Silhouettes: A graphical aid to the interpretation and validation

of cluster analysis, J. Comput. Appl. Math. 20 (1987) 53–65. doi:10.1016/0377-

0427(87)90125-7.

[15] L. Scrucca, M. Fop, T.B. Murphy, A.E. Raftery, mclust 5: Clustering,

Classification and Density Estimation Using Gaussian Finite Mixture Models, R

J. 8 (2016) 289–317. doi:10.1177/2167702614534210.

[16] J. Oksanen, F.G. Blanchet, R. Kindt, P. Legendre, P.R. Minchin, R.B. O’Hara,

G.L. Simpson, P. Solymos, M. Henry, H. Stevens, H. Wagner, vegan: Community

Ecology Package, R Packag. Version 2.4-1. https//CRAN.R-

Project.org/package=vegan. (2017). https://cran.r-project.org/package=vegan.

[17] C. Hennig, fpc: Flexible Procedures for Clustering., (2018). https://cran.r-

project.org/package=fpc.

[18] U. Bodenhofer, A. Kothmeier, S. Hochreiter, Apcluster: An R package for affinity

propagation clustering, Bioinformatics. 27 (2011) 2463–2464.

doi:10.1093/bioinformatics/btr406.

Page 28: kluster: An Efficient Scalable Procedure for Approximating the Number of Clusters in Unsupervised … · representations of the input data to facilitate further prediction and/or

pre-print https://doi.org/10.1016/j.bdr.2018.05.003

[19] W. Qiu, H. Joe, clusterGeneration: Random Cluster Generation (with Specified

Degree of Separation), (2015). https://cran.r-

project.org/package=clusterGeneration.

[20] R. Nalichowski, D. Keogh, H.C. Chueh, S.N. Murphy, Calculating the benefits of

a Research Patient Data Repository., AMIA Annu. Symp. Proc. (2006) 1044.

[21] S. García, D. Molina, M. Lozano, F. Herrera, A study on the use of non-parametric

tests for analyzing the evolutionary algorithms’ behaviour: A case study on the

CEC’2005 Special Session on Real Parameter Optimization, J. Heuristics. 15 (2009)

617–644. doi:10.1007/s10732-008-9080-4.

[22] S. García, A. Fernández, J. Luengo, F. Herrera, Advanced nonparametric tests for

multiple comparisons in the design of experiments in computational intelligence

and data mining: Experimental analysis of power, Inf. Sci. (Ny). 180 (2010)

2044–2064. doi:https://doi.org/10.1016/j.ins.2009.12.010.

[23] G. Santafe, I. Inza, J.A. Lozano, Dealing with the evaluation of supervised

classification algorithms, Artif. Intell. Rev. 44 (2015) 467–508.

doi:10.1007/s10462-015-9433-y.

[24] B. Calvo, G. Santafé, scmamp: Statistical Comparison of Multiple Algorithms in

Multiple Problems, R J. XX (2015) 8.

[25] F. Wilcoxon, Individual Comparisons by Ranking Methods, Biometrics Bull. 1 (1945)

80. doi:10.2307/3001968.

[26] S. Holm, A simple sequential rejective multiple test procedure, Scand. J. Stat.

6 (1979) 65–70. doi:10.2307/4615733.

[27] M. Friedman, The Use of Ranks to Avoid the Assumption of Normality Implicit in

the Analysis of Variance, J. Am. Stat. Assoc. 32 (1937) 675–701.

doi:10.1080/01621459.1937.10503522.

[28] B. Bergmann, G. Hommel, Improvements of General Multiple Test Procedures for

Redundant Systems of Hypotheses, in: P. Bauer, G. Hommel, E. Sonnemann (Eds.),

Mult. Hypothesenprüfung / Mult. Hypotheses Test., Springer Berlin Heidelberg,

Berlin, Heidelberg, 1988: pp. 100–115.

[29] J. Demšar, Statistical Comparisons of Classifiers over Multiple Data Sets, J.

Mach. Learn. Res. 7 (2006) 1–30. doi:10.1016/j.jecp.2010.03.005.

[30] W.H. Wolberg, W.N. Street, O.L. Mangasarian, Breast Cancer Wisconsin

(Diagnostic) Data Set, UCI Mach. Learn. Repos. (1992).

[31] R.A. Fisher, The use of multiple measurements in taxonomic problems, Ann. Eugen.

7 (1936) 179–188. doi:10.1111/j.1469-1809.1936.tb02137.x.

Page 29: kluster: An Efficient Scalable Procedure for Approximating the Number of Clusters in Unsupervised … · representations of the input data to facilitate further prediction and/or

pre-print https://doi.org/10.1016/j.bdr.2018.05.003

[32] Kory Becker, Identifying the Gender of a Voice using Machine Learning | Primary

Objects, (2016). http://www.primaryobjects.com/2016/06/22/identifying-the-

gender-of-a-voice-using-machine-learning/ (accessed August 20, 2017).

[33] J.W. Smith, J.E. Everhart, W.C. Dickson, W.C. Knowler, R.S. Johannes, Using the

ADAP Learning Algorithm to Forecast the Onset of Diabetes Mellitus, Proc. Annu.

Symp. Comput. Appl. Med. Care. (1988) 261–265.

[34] Y.X. Fang, J.H. Wang, Selection of the number of clusters via the bootstrap

method, Comput. Stat. Data Anal. 56 (2012) 468–477.

doi:10.1016/j.csda.2011.09.003.

[35] A.K. Jain, J. V. Moreau, Bootstrap technique in cluster analysis, Pattern

Recognit. 20 (1987) 547–568. doi:10.1016/0031-3203(87)90081-1.

[36] C. Garcia, BoCluSt: Bootstrap clustering stability algorithm for community

detection, PLoS One. 11 (2016). doi:10.1371/journal.pone.0156576.

[37] M.K. Kerr, G.A. Churchill, Bootstrapping cluster analysis: Assessing the

reliability of conclusions from microarray experiments, Proc. Natl. Acad. Sci.

98 (2001) 8961–8965. http://www.pnas.org/cgi/content/abstract/98/16/8961.

[38] M.A. Newell, D. Cook, H. Hofmann, J.L. Jannink, An algorithm for deciding the

number of clusters and validation using simulated data with application to

exploring crop population structure, Ann. Appl. Stat. 7 (2013) 1898–1916.

doi:10.1214/13-AOAS671.