deep learning for enhancing precision medicine min oh
TRANSCRIPT
Deep Learning for Enhancing Precision Medicine
Min Oh
Dissertation submitted to the faculty of the Virginia Polytechnic Institute and State University in
partial fulfillment of the requirements for the degree of
Doctor of Philosophy
In
Computer Science and Applications
Liqing Zhang (Chair)
Bert Huang
B. Aditya Prakash
Zhi Sheng
Youngmi Yoon
May 10th, 2021
Blacksburg, Virginia
Keywords: Deep Learning, Precision Medicine, Omics data
© 2021, Min Oh CC BY
Deep Learning for Enhancing Precision Medicine
Min Oh
ABSTRACT
Most medical treatments have been developed aiming at the best-on-average efficacy for large
populations, resulting in treatments successful for some patients but not for others. It necessitates
the need for precision medicine that tailors medical treatment to individual patients. Omics data
holds comprehensive genetic information on individual variability at the molecular level and hence
the potential to be translated into personalized therapy. However, the attempts to transform omics
data-driven insights into clinically actionable models for individual patients have been limited.
Meanwhile, advances in deep learning, one of the most promising branches of artificial intelligence,
have produced unprecedented performance in various fields. Although several deep learning-based
methods have been proposed to predict individual phenotypes, they have not established the state
of the practice, due to instability of selected or learned features derived from extremely high
dimensional data with low sample sizes, which often results in overfitted models with high
variance. To overcome the limitation of omics data, recent advances in deep learning models,
including representation learning models, generative models, and interpretable models, can be
considered. The goal of the proposed work is to develop deep learning models that can overcome
the limitation of omics data to enhance the prediction of personalized medical decisions. To
achieve this, three key challenges should be addressed: 1) effectively reducing dimensions of
omics data, 2) systematically augmenting omics data, and 3) improving the interpretability of
omics data.
Deep Learning for Enhancing Precision Medicine
Min Oh
GENERAL AUDIENCE ABSTRACT
Most medical treatments have been developed aiming at the best-on-average efficacy for large
populations, resulting in treatments successful for some patients but not for others. It necessitates
the need for precision medicine that tailors medical treatment to individual patients. Biological
data such as DNA sequences and snapshots of genetic activities hold comprehensive information
on individual variability and hence the potential to accelerate personalized therapy. However, the
attempts to transform data-driven insights into clinical models for individual patients have been
limited. Meanwhile, advances in deep learning, one of the most promising branches of artificial
intelligence, have produced unprecedented performance in various fields. Although several deep
learning-based methods have been proposed to predict individual treatment or outcome, they have
not established the state of the practice, due to the complexity of biological data and limited
availability, which often result in overfitted models that may work on training data but not on test
data or unseen data. To overcome the limitation of biological data, recent advances in deep learning
models, including representation learning models, generative models, and interpretable models,
can be considered. The goal of the proposed work is to develop deep learning models that can
overcome the limitation of omics data to enhance the prediction of personalized medical decisions.
To achieve this, three key challenges should be addressed: 1) effectively reducing the complexity
of biological data, 2) generating realistic biological data, and 3) improving the interpretability of
biological data.
iv
Acknowledgments
I found happiness and pleasure from accomplishments during my Ph.D. program but from time to
time, it was like walking down an endless tunnel and chaining of many frustrations come from
uncertainties. I was able to get it done as so many people support and help me. Above all, I would
like to express my sincere appreciation to my wife, Boram Choi, for her devoted love and support.
Without her support, it was impossible to finish my degree. Also, I express my sincere gratitude
to my mother, Jin-Hyang Kang, for letting me dream big and praying for blessings. It was my great
privilege and pleasure to have Professor Liqing Zhang as my advisor. I appreciate her support,
advising, and especially, being on my side in many uncertain situations. I would like to thank my
committee members, Dr. Youngmi Yoon, Dr. Bert Huang, Dr. B. Aditya Prakash, and Dr. Zhi
Sheng. Especially, Dr. Yoon encouraged me to dream about being Ph.D. overseas and
continuously dedicated to inspiring me to succeed. I was lucky enough to have four internships at
Microsoft and met smart and warm mentors and managers who help me grow up. Specifically, I
appreciate Dr. Erdal Coşgun, Dr. Alexandra Savelieva, Santhanagopalan Raghavan, Manuel
Schröder, and Rouslan Beletski.
v
Table of contents
Introduction ......................................................................................................................... 1
Objectives ........................................................................................................................... 4
Chapter 1: Deep representation learning for disease prediction based on microbiome data
......................................................................................................................................... 6
1.1 Introduction ............................................................................................................... 6
1.2 Methods ..................................................................................................................... 7
1.3 Results ..................................................................................................................... 14
1.4 Discussion ................................................................................................................ 17
Chapter 2: Generalizing predictions to unseen sequencing profiles via visual data
augmentation .................................................................................................................... 19
2.1 Introduction ............................................................................................................. 19
2.2 Results ..................................................................................................................... 20
2.3 Discussion ................................................................................................................ 28
2.4 Methods ................................................................................................................... 28
Chapter 3: Deep generalized interpretable autoencoder elucidates gut microbiota for better
cancer immunotherapy ...................................................................................................... 34
3.1 Introduction ............................................................................................................. 34
3.2 Methods ................................................................................................................... 36
3.3 Results ..................................................................................................................... 40
3.4 Discussion ................................................................................................................ 43
Conclusion ........................................................................................................................ 45
Appendix A ....................................................................................................................... 47
Appendix B ....................................................................................................................... 59
Appendix C ....................................................................................................................... 86
References ......................................................................................................................... 89
1
Introduction
Most medical treatments have been developed aiming at the best-on-average efficacy for large populations,
resulting in treatments successful for some patients but not for others [1]. It necessitates the need for more
precise medical treatment and prevention strategies with consideration of individual variability [2].
Precision medicine refers to the customization of medical treatment tailored to the individual characteristics
of each patient [3]. With a better understanding of the patient’s genetic information, medical decisions can
be personalized and more effective [4, 5]. As a basis enabling precision medicine, omics data, including
genomic, transcriptomic, and metagenomic data, is increasingly being studied [6-8]. Omics data holds
comprehensive genetic information presenting individual variability at the molecular level and hence the
potential to be translated into personalized therapy. However, the attempts to transform omics data-driven
insights into clinically actionable models for an individual patient have been limited.
Meanwhile, advances in artificial intelligence have enabled the smarter data-driven approach for biomedical
research. Especially, deep learning, one of the most promising branches of artificial intelligence has
produced unprecedented performance in various fields [9]. The major advancements have been in image
and speech recognition as well as natural language processing and language translation. The successes of
deep learning originate from how it learns hierarchical representations of data by increasing the level of
abstraction [10]. Several deep learning-based solutions have been proposed to predict individual phenotypes,
by engineering omics data features using state-of-the-art deep architectures [11-13]. These deep models
outperform traditional machine learning models in terms of accuracy, however, they have not established
the state of the practice, due to instability of selected or learned features derived from extremely high
dimensional data with low sample sizes. Generally, omics data is high-dimensional compared to the number
of samples in most studies. For example, a typical gene expression study measures the activity of tens of
thousands of genes per person, while only a few hundred patients and healthy controls are examined. The
sample-dimension ratio could be even much lower in metagenomic data, which contains hundreds of
thousands of strain-level gene markers but for only some hundred or fewer samples. In contrast to typical
deep models trained on a massive amount of data in relatively low-dimensional space in fields such as
image recognition [14], models trained on the limited omics data in high-dimensional space are often
overfitted with high variance. The high-dimensional omics data with a low sample size entails the sparsity
of the data in feature space, limiting generalization of the learned model. As a result, most deep models
based on omics data come with little or no guarantees for reliable decisions on unseen data.
One naïve solution to address the problem of high dimensional data with low sample sizes might be to
collect samples much greater than the size of the dimensions. However, in general practice, it is nearly
2
impossible to secure that many samples regardless of the types of omics data. For example, the cost of
collecting a cost-effective gene expression profile (FDA-approved 2000-GEP test) to diagnose a tumor site
is about $3,300 per person in US dollars [15]. This test measures expression levels of over two thousand
genes for each person and approximately 7 million dollars is needed to acquire 2000 samples. Ideally, at
least 178,000 samples might be required to achieve the same sample-dimension ratio as MNIST handwritten
digit data set that has led to successful deep learning models, and it costs over 587 million US dollars. Even
if the cost issue is resolved, it is highly unrealistic to recruit patients under the same medical condition and
consent to the use of their samples. For instance, in 2019, about 30,000 cases in the US were diagnosed
with small cell lung cancer that shows the least 5-year survival rate (6%) among the types of lung cancer
which is the leading cause of cancer death [16]. Even though it is assumed that collecting the data from all
the survivors is possible, the number of available cases is far less than that required.
As a feasible solution, traditionally, dimensionality reduction algorithms have been utilized to overcome
the high dimensionality. Dimensionality reduction algorithms aim at mapping original samples in high-
dimensional space into low-dimensional space. Generally, linear embedding techniques such as principle
component analysis were widely used, although usually there is significant information loss as only very
limited amounts of variance in the omics data were explained by mapped variables. Another potential
solution is generating synthetic omics data. Formerly, data augmentation techniques were used to amplify
training data and were able to regularize prediction models to some extent. However, the regularized models
tend to underperform in external validation and this implies current data augmentation techniques may not
generalize to an unseen domain whose data distribution is significantly shifted.
Hence, to overcome the limitation of omics data, a more reasonable way should be examined. Currently,
although recent advances in deep learning models have been made in particular fields such as computer
vision and natural language processing, only a little effort has been devoted to applying them to omics data.
Consequently, most prediction models are limited to perform well only in a carefully controlled omics data
set. Thus, the goal of the proposed work is to develop deep learning models that can overcome the limitation
of omics data to enhance the prediction of personalized medical decisions. To achieve this, three key
challenges should be addressed: 1) effectively reducing dimensions of omics data, 2) systematically
augmenting omics data, and 3) improving the interpretability of omics data. The key factor enabling deep
learning to address the challenges is the combination of lessons derived from deep representation learning,
deep generative models, and interpretable deep learning:
• Deep neural networks have been successfully utilized to extract better representations to be suitable
for secondary analysis in various domains, including image recognition and audio translation,
3
compared to traditional representation learning algorithms. This deep representation learning may
allow us to capture the underlying hidden structure of complex omics data in a low-dimensional
space. By improving the quality of embedding in biologically relevant latent space, it might be
possible to get a better interpretation of data in much lower-dimensional space.
• Deep generative models have been applied to augment image data and it has improved image
classification performance. The success in data augmentation with deep generative models could
be transferred to the omics data by learning the probability distribution of omics data and
amplifying training data with synthetic data.
• Deep learning models are usually black-box and their outcomes are difficult to interpret. The lack
of interpretability may prevent the prediction models from being adopted in clinical practice as
clinicians and decision-makers prioritize the explainability of the predictions. Interpretable deep
learning has the potential to provide useful insights for understanding prediction derived from
omics data.
4
Objectives
The goal of the proposed work is to develop deep learning models that can overcome the limitation of omics
data derived from its high dimensionality to enhance the prediction of personalized medical decisions.
Objective 1: Deep Representation Learning for Disease Prediction Based on Microbiome Data
Human microbiota plays a key role in human health and growing evidence supports the potential use of
microbiome as a predictor of various diseases. However, the high-dimensionality of microbiome data, often
in the order of hundreds of thousands, yet low sample sizes, poses a great challenge for machine learning-
based prediction algorithms. This imbalance induces the data to be highly sparse, preventing from learning
a better prediction model. Also, there has been little work on deep learning applications to microbiome data
with a rigorous evaluation scheme. To address these challenges, we propose DeepMicro, a deep
representation learning framework allowing for the effective representation of microbiome profiles.
DeepMicro transforms high-dimensional microbiome data into a robust low-dimensional representation
using various autoencoders and applies machine learning classification algorithms on the learned
representation.
Objective 2: Generalizing Predictions to Unseen Sequencing Profiles via Visual Data Augmentation
Predictive models trained on sequencing profiles often fail to achieve expected performance when
externally validated on unseen profiles. While many factors such as batch effects, small data sets, and
technical errors contribute to the gap between source and unseen data distributions, it is a challenging
problem to generalize the predictive models across studies without any prior knowledge of the unseen data
distribution. Here, this study proposes DeepBioGen, a sequencing profile augmentation procedure that
characterizes visual patterns of sequencing profiles, generates realistic profiles based on a deep generative
model capturing the patterns, and generalizes the subsequent classifiers.
Objective 3: Deep Generalized Interpretable Autoencoder Elucidating Gut Microbiota for Better
Cancer Immunotherapy
Recent studies revealed that gut microbiota modulates the response to cancer immunotherapy and fecal
microbiota transplantation has clinical benefit in melanoma patients during the treatment. Understanding
5
microbiota affecting individual response is crucial to advance precision oncology. However, it is
challenging to identify the key microbial taxa with limited data as statistical and machine learning models
often lose their generalizability. In this study, DeepGeni, a deep generalized interpretable autoencoder, is
proposed to improve the generalizability and interpretability of microbiome profiles by augmenting data
and by introducing interpretable links in the autoencoder.
6
Chapter 1: Deep Representation Learning for Disease Prediction Based on
Microbiome Data
1.1 Introduction
As our knowledge of microbiota grows, it becomes increasingly clear that the human microbiota plays a
key role in human health and diseases [17]. The microbial community, composed of trillions of microbes,
is a complex and diverse ecosystem living on and inside a human. These commensal microorganisms
benefit humans by allowing them to harvest inaccessible nutrients and maintain the integrity of mucosal
barriers and homeostasis. Especially, the human microbiota contributes to the host immune system
development, affecting multiple cellular processes such as metabolism and immune-related functions [17,
18]. They have been shown to be responsible for carcinogenesis of certain cancers and substantially affect
therapeutic response [19]. All these emerging evidences substantiate the potential use of microbiota as a
predictor for various diseases [20].
The development of high-throughput sequencing technologies has enabled researchers to capture a
comprehensive snapshot of the microbial community of interest. The most common components of the
human microbiome can be profiled with 16S rRNA gene sequencing technology in a cost-effective way
[21]. Comparatively, shotgun metagenomic sequencing technology can provide a deeper resolution profile
of the microbial community at the strain level [22, 23]. As the cost of shotgun metagenomic sequencing
keeps decreasing and the resolution increasing, it is likely that a growing role of the microbiome in human
health will be uncovered from the mounting metagenomic datasets.
Although novel technologies have dramatically increased our ability to characterize human microbiome
and there is evidence suggesting the potential use of the human microbiome for predicting disease state,
how to effectively utilize the human microbiome data faces several key challenges. Firstly, effective
dimensionality reduction that preserves the intrinsic structure of the microbiome data is required to handle
the high dimensional data with low sample sizes, especially the microbiome data with strain-level
information that often contain hundreds of thousands of gene markers but for only some hundred or fewer
samples. With a low number of samples, large number of features can cause the curse of dimensionality,
usually inducing sparsity of the data in the feature space. Along with traditional dimensionality reduction
algorithms, autoencoder that learns a low-dimensional representation by reconstructing the input [24] can
be applied to exploit microbiome data. Secondly, given the fast amounting metagenomic data, there is an
inadequate effort in adapting machine learning algorithms for predicting disease state based on microbiome
7
data. In particular, deep learning is a class of machine learning algorithms that builds on large multi-layer
neural networks, and that can potentially make effective use of metagenomic data. With the rapidly growing
attention from both academia and industry, deep learning has produced unprecedented performance in
various fields, including not only image and speech recognition, natural language processing, and language
translation but also biological and healthcare research [11]. A few studies have applied deep learning
approaches to abundance profiles of the human gut microbiome for disease prediction [25, 26]. However,
there has been no research utilizing strain-level profiles for the purpose. Comparatively, strain level profiles,
often containing hundreds of thousands of gene markers’ information, should be more informative for
accurately classifying the samples into patient and healthy control groups across different types of diseases
than abundance profiles that usually contain only a few hundred bacteria’s abundance information [27].
Lastly, to evaluate and compare the performance of machine learning models, it is necessary to introduce a
rigorous validation framework to estimate their performance over unseen data. Pasolli et al., a study that
built classification models based on microbiome data, utilized a 10-fold cross-validation scheme that tunes
the hyper-parameters on the test set without using a validation set [27]. This approach may overestimate
model performance as it exposes the test set to the model in the training procedure [28, 29].
To address these issues, we propose DeepMicro, a deep representation learning framework that deploys
various autoencoders to learn robust low-dimensional representations from high-dimensional microbiome
profiles and trains classification models based on the learned representation. We applied a thorough
validation scheme that excludes the test set from hyper-parameter optimization to ensure fairness of model
comparison. Our model surpasses the current best methods in terms of disease state prediction of
inflammatory bowel disease, type 2 diabetes in the Chinese cohort as well as European women cohort, liver
cirrhosis, and obesity. DeepMicro is open-sourced and publicly available software to benefit future research,
allowing researchers to obtain a robust low-dimensional representation of microbiome profiles with user-
defined deep architecture and hyper-parameters.
1.2 Methods
Dataset and Extracting Microbiome Profiles
We considered publicly available human gut metagenomic samples of six different disease cohorts:
inflammatory bowel disease (IBD), type 2 diabetes in European women (EW-T2D), type 2 diabetes in
Chinese (C-T2D) cohort, obesity (Obesity), liver cirrhosis (Cirrhosis), and colorectal cancer (Colorectal).
All these samples were derived from whole-genome shotgun metagenomic studies that used Illumina
8
paired-end sequencing technology. Each cohort consists of healthy control and patient samples as shown in
Table 1. IBD cohort has 25 individuals with inflammatory bowel disease and 85 healthy controls [30]. EW-
T2D cohort has 53 European women with type 2 diabetes and 43 healthy European women [31]. C-T2D
cohort has 170 Chinese individuals with type 2 diabetes and 174 healthy Chinese controls [32]. Obesity
cohort has 164 obese patients and 89 non-obese controls [33]. Cirrhosis cohort has 118 patients with liver
cirrhosis and 114 healthy controls [34]. Colorectal cohort has 48 colorectal cancer patients and 73 healthy
controls [35]. In total, 1,156 human gut metagenomic samples, obtained from MetAML repository [27],
were used in our experiments.
Table 1. Human gut microbiome datasets used for disease state prediction
Disease Dataset
name
# total
samples
# of healthy
controls
# of patient
samples
Data source
references
Inflammatory Bowel
Disease IBD 110 85 25 [30]
Type 2 Diabetes EW-T2D 96 43 53 [31]
C-T2D 344 174 170 [32]
Obesity Obesity 253 89 164 [33]
Liver Cirrhosis Cirrhosis 232 114 118 [34]
Colorectal Cancer Colorectal 121 73 48 [35]
Two types of microbiome profiles were extracted from the metagenomic samples: 1) strain-level marker
profile and 2) species-level relative abundance profile. MetaPhlAn2 was utilized to extract these profiles
with default parameters [23]. We utilized MetAML to preprocess the abundance profile by selecting
species-level features and excluding sub-species-level features [27]. The strain-level marker profile consists
of binary values indicating the presence (1) or absence (0) of a certain strain. The species-level relative
abundance profile consists of real values in [0,1] indicating the percentages of the species in the total
observed species. The abundance profile has a few hundred dimensions, whereas the marker profile has a
much larger number of dimensions, up to over a hundred thousand in the current data (Table 2).
Table 2. The number of dimensions of the preprocessed microbiome profiles
Profile type IBD EW-T2D C-T2D Obesity Cirrhosis Colorectal
marker profile 91,756 83,456 119,792 99,568 120,553 108,034
abundance
profile 443 381 572 465 542 503
9
Deep Representation Learning
An autoencoder is a neural network reconstructing its input 𝑥. Internally, its general form consists of an
encoder function 𝑓𝜙(∙) and a decoder function 𝑓′𝜃(∙) where 𝜙 and 𝜃 are parameters of encoder and decoder
functions, respectively. An autoencoder is trained to minimize the difference between an input 𝑥 and a
reconstructed input 𝑥′, the reconstruction loss (e.g., squared error) that can be written as follows:
𝐿(𝑥, 𝑥′) = ‖𝑥 − 𝑥′‖2 = ‖𝑥 − 𝑓′𝜃 (𝑓𝜙(𝑥))‖2.
After training an autoencoder, we are interested in obtaining a latent representation 𝑧 = 𝑓𝜙(𝑥) of the input
using the trained encoder. The latent representation, usually in a much lower-dimensional space than the
original input, contains sufficient information for reconstructing the original input as close as possible. We
utilized this representation to train classifiers for disease prediction.
For the DeepMicro framework, we incorporated various deep representation learning techniques, including
shallow autoencoder (SAE), deep autoencoder (DAE), variational autoencoder (VAE), and convolutional
autoencoder (CAE), to learn a low-dimensional embedding for microbiome profiles. Note that the diverse
combinations of hyper-parameters defining the structure of autoencoders (e.g., the number of units and
layers) have been explored in a grid fashion as described below, however, users are not limited to the tested
hyper-parameters and can use their own hyper-parameter grid fitted to their data.
Firstly, we utilized SAE, the simplest autoencoder structure composed of the encoder part where the input
layer is fully connected with the latent layer, and the decoder part where the output layer produces
reconstructed input 𝑥′ by taking weighted sums of outputs of the latent layer. We introduced a linear
activation function for the latent and output layer. Other options for the loss and activation functions are
available for users (such as binary cross-entropy and sigmoid function). Initial values of the weights and
bias were initialized with Glorot uniform initializer [36]. We examined five different sizes of dimensions
for the latent representation (32, 64, 128, 256, and 512).
In addition to the SAE model, we implemented the DAE model by introducing hidden layers between the
input and latent layers as well as between the latent and output layers. All of the additional hidden layers
were equipped with Rectified Linear Unit (ReLu) activation function and Glorot uniform initializer. The
same number of hidden layers (one layer or two layers) were inserted into both encoder and decoder parts.
Also, we gradually increased the number of hidden units. The number of hidden units in the added layers
was set to the double of the successive layer in the encoder part and to the double of the preceding layer in
the decoder part. With this setting, model complexity is controlled by both the number of hidden units and
10
the number of hidden layers, maintaining structural symmetry of the model. For example, if the latent layer
has 512 hidden units and if two layers are inserted to the encoder and decoder parts, then the resulting
autoencoder has 5 hidden layers with 2048, 1024, 512, 1024, and 2048 hidden units, respectively. Similar
to SAE, we varied the number of hidden units in the latent layer as follows: 32, 64, 128, 256, 512, thus, in
total, we tested 10 different DAE architectures (Appendix A Table A2).
A variational autoencoder (VAE) learns probabilistic representations 𝑧 given input 𝑥 and then use these
representations to reconstruct input 𝑥′ [37]. Using variational inference, the true posterior distribution of
latent embeddings (i.e., 𝑝(𝑧|𝑥)) can be approximated by the introduced posterior 𝑞𝜙(𝑧|𝑥) where 𝜙 are
parameters of an encoder network. Unlike the previous autoencoders learning an unconstrained
representation, VAE learns a generalized latent representation under the assumption that the posterior
approximation follows Gaussian distribution. The encoder network encodes the means and variances of the
multivariate Gaussian distribution. The latent representation 𝑧 can be sampled from the learned posterior
distribution 𝑞𝜙(𝑧|𝑥) ~ Ν(𝜇, Σ). Then the sampled latent representation is passed into the decoder network
to generate the reconstructed input 𝑥′ ~ 𝑔𝜃(𝑥|𝑧) where 𝜃 are the parameters of the decoder.
To approximate the true posterior, we need to minimize the Kullback-Leibler (KL) divergence between the
introduced posterior and the true posterior,
𝐾𝐿 (𝑞𝜙(𝑧|𝑥)||𝑝(𝑧|𝑥)) = −𝐸𝐿𝐵𝑂(𝜙, 𝜃; 𝑥) + log(𝑝(𝑥)),
rewritten as
log(𝑝(𝑥)) = 𝐸𝐿𝐵𝑂(𝜙, 𝜃; 𝑥) + 𝐾𝐿 (𝑞𝜙(𝑧|𝑥)||𝑝(𝑧|𝑥)),
where 𝐸𝐿𝐵𝑂(𝜙, 𝜃; 𝑥) is an evidence lower bound on the log probability of the data because the KL term
must be greater than or equal to zero. It is intractable to compute the KL term directly but minimizing the
KL divergence is equivalent to maximizing the lower bound, decomposed as follows:
𝐸𝐿𝐵𝑂(𝜙, 𝜃; 𝑥) = 𝔼𝑞𝜙(𝑧|𝑥)[log(𝑔𝜃(𝑥|𝑧))] − 𝐾𝐿 (𝑞𝜙(𝑧|𝑥)||𝑝(𝑧)).
The final objective function can be induced by converting the maximization problem to the minimization
problem.
𝐿(𝜙, 𝜃; 𝑥) = −𝔼𝑞𝜙(𝑧|𝑥)[log(𝑔𝜃(𝑥|𝑧))] + 𝐾𝐿 (𝑞𝜙(𝑧|𝑥)||𝑝(𝑧))
11
The first term can be viewed as a reconstruction term as it forces the inferred latent representation to recover
its corresponding input and the second KL term can be considered as a regularization term to modulate the
posterior of the learned representation to be Gaussian distribution. We used ReLu activation and Glorot
uniform initializer for intermediate hidden layers in encoder and decoder. One intermediate hidden layer
was used and the number of hidden units in it varied from 32, 64, 128, 256, to 512. The latent layer was set
to 4, 8, or 16 units. Thus, altogether we tested 15 different model structures.
Instead of fully connected layers, a convolutional autoencoder (CAE) is equipped with convolutional layers
in which each unit is connected to only local regions of the previous layer [38]. A convolutional layer
consists of multiple filters (kernels) and each filter has a set of weights used to perform convolution
operation that computes dot products between a filter and a local region [39]. We used ReLu activation and
Glorot uniform initializer for convolutional layers. We did not use any pooling layer as it may generalize
too much to reconstruct an input. The 𝑛-dimensional input vector was reshaped like a squared image with
a size of 𝑑 × 𝑑 × 1 where 𝑑 = ⌊√𝑛⌋ + 1. As 𝑑2 ≥ 𝑛, we padded the rest part of the reshaped input with
zeros. To be flexible to an input size, the filter size of the first convolutional layer was set to 10% of the
input width and height, respectively (i.e. ⌊0.1𝑑⌋ × ⌊0.1𝑑⌋). For the first convolutional layer, we used 25%
of the filter size as the size of stride which configures how much we slide the filter. For the following
convolutional layers in the encoder part, we used 10% of the output size of the preceding layer as the filter
size and 50% of this filter size as the stride size. All units in the last convolutional layer of the encoder part
have been flattened in the following flatten layer which is designated as a latent layer. We utilized
convolutional transpose layers (deconvolutional layers) to make the decoder symmetry to the encoder. In
our experiment, the number of filters in a convolutional layer was set to half of that of the preceding layer
for the encoder part. For example, if the first convolutional layer has 64 filters and there are three
convolutional layers in the encoder, then the following two convolutional layers have 32 and 16 filters,
respectively. We varied the number of convolutional layers from 2 to 3 and tried five different numbers of
filters in the first convolutional layer (4, 8, 16, 32, and 64). In total, we tested 10 different CAE model
structures.
To train deep representation models, we split each dataset into a training set, a validation set, and a test set
(64% training set, 16% validation set, and 20% test set; Appendix A Figure A1). Note that the test set was
withheld from training the model. We used the early-stopping strategy, that is, trained the models on the
training set, computed the reconstruction loss for the validation set after each epoch, stopped the training if
there was no improvement in validation loss during 20 epochs, and then selected the model with the least
validation loss as the best model. We used mean squared error for reconstruction loss and applied adaptive
12
moment estimation (Adam) optimizer for gradient descent with default parameters (learning rate: 0.001,
epsilon: 1e-07) as provided in the original paper [40]. We utilized the encoder part of the best model to
produce a low-dimensional representation of the microbiome data for downstream disease prediction.
Prediction of disease states based on the learned representation
We built classification models based on the encoded low-dimensional representations of microbiome
profiles (Figure 1). Three machine learning algorithms, support vector machine (SVM), random forest (RF),
and Multi-Layer Perceptron (MLP), were used. We explored hyper-parameter space with grid search. SVM
maximizes the margin between the supporting hyperplanes to optimize a decision boundary separating data
points of different classes [41]. In this study, we utilized both radial basis function (RBF) kernel and a linear
kernel function to compute decision margins in the transformed space to which the original data was
mapped. We varied penalty parameter C (2-5, 2-3, …, 25) for both kernels as well as kernel coefficient gamma
(2-15, 2-13, …, 23) for RBF kernel. In total, 60 different combinations of hyper-parameters were examined to
optimize SVM (Appendix A Table A2).
13
Figure 1. DeepMicro framework. An autoencoder is trained to map the input X to the low-dimensional
latent space with the encoder and to reconstruct X with the decoder. The encoder part is reused to produce
a latent representation of any new input X that is in turn fed into a classification algorithm to determine
whether the input is the positive or negative class.
RF builds multiple decision trees based on various sub-samples of the training data and merges them to
improve the prediction accuracy. The size of sub-samples is the same as that of training data but the samples
are drawn randomly with replacement from the training data. For the hyper-parameter grid of RF classifier,
the number of trees (estimators) was set to 100, 300, 500, 700, and 900, and the minimum number of
samples in a leaf node was altered from 1 to 5. Also, we tested two criteria, Gini impurity and information
gain, for selecting features to split a node in a decision tree. For the maximum number of features considered
to find the best split at each split, we used a square root of 𝑛 and a logarithm to base 2 of 𝑛 (𝑛 is the sample
size). In total, we tested 100 combinations of hyper-parameters of RF.
MLP is an artificial neural network classifier that consists of an input layer, hidden layers, and an output
layer. All of the layers are fully connected to their successive layer. We used ReLu activations for all hidden
layers and sigmoid activation for the output layer that has a single unit. The number of units in the hidden
layers was set to half of that of the preceding layer except the first hidden layer. We varied the number of
hidden layers (1, 2, and 3), the number of epochs (30, 50, 100, 200, and 300), the number of units in the
first hidden layer (10, 30, 50, 100), and dropout rate (0.1 and 0.3). In total, 120 hyper-parameter
combinations were tested in our experiment.
We implemented DeepMicro in Python 3.5.2 using machine learning and data analytics libraries, including
Numpy 1.16.2, Pandas 0.24.2, Scipy 1.2.1, Scikt-learn 0.20.3, Keras 2.2.4, and Tensorflow 1.13.1. Source
code is publicly available at the git repository (https://github.com/minoh0201/DeepMicro).
Performance Evaluation
To avoid an overestimation of prediction performance, we designed a thorough performance evaluation
scheme (Appendix A Figure A1). For a given dataset (e.g. Cirrhosis), we split it into training and test set in
the ratio of 8:2 with a given random partition seed, keeping a ratio between classes in both training and test
set to be the same as that of the given dataset. Using only the training set, a representation learning model
was trained. Then, the learned representation model was applied to the training set and test set to obtain
14
dimensionality-reduced training and test set. After the dimensionality has been reduced, we conducted 5-
fold cross-validation on the training set by varying hyper-parameters of classifiers. The best hyper-
parameter combination for each classifier was selected by averaging an accuracy metric of the five different
results. The area under the receiver operating characteristics curve (AUC) was used for performance
evaluation. We trained a final classification model using the whole training set with the best combination
of hyper-parameters and tested it on the test set. This procedure was repeated five times by changing the
random partition seed at the beginning of the procedure. The resulting AUC scores were averaged and the
average was used to compare model performance.
1.3 Results
We developed DeepMicro, a deep representation learning framework for predicting individual phenotype
based on microbiome profiles. Various autoencoders (SAE, DAE, VAE, and CAE) have been utilized to
learn a low-dimensional representation of the microbiome profiles. Then three classification models
including SVM, RF, and MLP were trained on the learned representation to discriminate between disease
and control sample groups. We tested our framework on six disease datasets (Table 1), including
inflammatory bowel disease (IBD), type 2 diabetes in European women (EW-T2D), type 2 diabetes in
Chinese (C-T2D), obesity (Obesity), liver cirrhosis (Cirrhosis), and colorectal cancer (Colorectal). For all
the datasets, two types of microbiome profiles, strain-level marker profile and species-level relative
abundance profile, have been extracted and tested (Table 2). Also, we devised a thorough performance
evaluation scheme that isolates the test set from the training and validation sets in the hyper-parameter
optimization phase to compare various models (See Methods and Appendix A Figure A1).
We compared our method to the current best approach (MetAML) that directly trained classifiers, such as
SVM and RF, on the original microbiome profile [27]. We utilized the same hyper-parameters grid used in
MetAML for each classification algorithm. In addition, we tested Principal Component Analysis (PCA)
and Gaussian Random Projection (RP), using them as the replacement of the representation learning to
observe how traditional dimensionality reduction algorithms behave. For PCA, we selected the principal
components explaining 99% of the variance in the data [42]. For RP, we set the number of components to
be automatically adjusted according to Johnson-Lindenstrauss lemma (eps parameter was set to 0.5) [43-
45].
We picked the best model for each approach in terms of prediction performance and compared the
approaches across the datasets. Figure 2 shows the results of DeepMicro and the other approaches for the
15
strain-level marker profile. DeepMicro outperforms the other approaches for five datasets, including IBD
(AUC = 0.955), EW-T2D (AUC = 0.899), C-T2D (AUC = 0.763), Obesity (AUC = 0.659), and Cirrhosis
(AUC = 0.940). For Colorectal dataset, DeepMicro has slightly lower performance than the best approach
(DeepMicro’s AUC = 0.803 vs. MetAML’s AUC = 0.811). The marker profile-based models generally
perform better than the abundance profile-based models (Appendix A Figure A8 and A2). The only
exception is Obesity dataset for which the abundance-based DeepMicro model shows better performance
(AUC = 0.674). Note that as AUC could be misleading in an imbalanced classification scenario [46], we
also evaluated the area under the precision-recall curve (AUPRC) for the imbalanced data set IBD and
observed the same trend between AUC and AUPRC (Appendix A Table A3).
Figure 2. Disease prediction performance for marker profile-based models. Prediction performance of
various methods built on marker profile has been assessed with AUC. MetAML utilizes support vector
machine (SVM) and random forest (RF), and the superior model is presented (green). Principal component
analysis (PCA; blue) and gaussian random projection (RP; yellow) have been applied to reduce dimensions
of datasets before classification. DeepMicro (red) applies shallow autoencoder (SAE), deep autoencoder
16
(DAE), variational autoencoder (VAE), and convolutional autoencoder (CAE) for dimensionality reduction.
Then SVM, RF, and multi-layer perceptron (MLP) classification algorithms have been used.
For marker profile, none of the autoencoders dominate across the datasets in terms of getting the best
representation for classification. Also, the best classification algorithm varied according to the learned
representation and to the dataset (Figure 3). For abundance profile, CAE dominates over the other
autoencoders with RF classifier across all the datasets (Appendix A Figure A3).
Figure 3. Disease prediction performance for different autoencoders based on marker profile (assessed with
AUC). Classifiers used: support vector machine (SVM), random forest (RF), and multi-layer perceptron
(MLP); Autoencoders used: shallow autoencoder (SAE), deep autoencoder (DAE), variational autoencoder
(VAE), and convolutional autoencoder (CAE)
We also directly trained MLP on the dataset without representation learning and compared the prediction
performance with that of the traditional approach (the best between SVM and RF). It is shown that MLP
performs better than MetAML in three datasets, EW-T2D, C-T2D, and Obesity, when marker profile is
used (Appendix A Figure A4). However, when abundance profile is used, the performance of MLP was
worse than that of the traditional approach across all the datasets (Appendix A Figure A5).
Furthermore, we compared running time of DeepMicro on marker profiles with a basic approach not using
representation learning. For comparison, we tracked both training time and representation learning time.
17
For each dataset, we tested the best performing representation learning model producing the highest AUC
score (i.e. SAE for IBD and EW-T2D, DAE for Obesity and Colorectal, and CAE for C-T2D and Cirrhosis;
Appendix A Table A1). We fixed the seed for random partitioning of the data, and applied the formerly
used performance evaluation procedure where 5-fold cross-validation is conducted on the training set to
obtain the best hyper-parameter with which the best model is trained on the whole training set and is
evaluated on the test set (See Methods). The computing machine we used for timestamping is running on
Ubuntu 18.04 and equipped with an Intel Core i9-9820X CPU (10 cores), 64 GB Memory, and a GPU of
NVIDIA GTX 1080 Ti. We note that our implementation utilizes GPU when it learns representations and
switches to CPU mode to exhaustively use multiple cores in a parallel way to find best hyper-parameters
of the classifiers. Table 3 shows the benchmarking result on marker profile. It is worth noting that
DeepMicro is 8X to 30X times faster than the basic approach (17X times faster on average). Even if MLP
is excluded from the benchmarking because it requires heavy computation, DeepMicro is up to 5X times
faster than the basic (2X times faster on average).
Table 3. Time benchmark for DeepMicro and basic approaches without representation learning (in sec)
Method IBD EW-T2D C-T2D Obesity Cirrhosis Colorectal
Basic
approach
SVM* 126 85 1705 711 777 187
RF 42 41 99 79 72 50
MLP 3,776 2,449 12,057 8,186 8,593 4,508
Total elapsed 3,943 2,575 13,861 8,976 9,442 4,745
DeepMicro
RL 74 194 554 113 521 215
SVM 2 2 8 8 17 2
RF 28 28 47 33 40 30
MLP 103 93 188 137 287 105
Total elapsed 207 317 798 291 864 352 *RL: Representation Learning; SVM: Support Vector Machine; RF: Random Forest; MLP: Multi-layer Perceptron
1.4 Discussion
We developed a deep learning framework transforming a high-dimensional microbiome profile into a low-
dimensional representation and building classification models based on the learned representation. At the
beginning of this study, the main goal was to reduce dimensions as strain-level marker profile has too many
dimensions to handle, expecting that noisy and unnecessary information fades out and the refined
representation becomes tractable for downstream prediction. Firstly, we tested PCA on marker profile and
it showed a slight improvement in prediction performance for C-T2D and Obesity but not for the others.
The preliminary result indicates that either some of the meaningful information was dropped or noisy
18
information still remains. To learn meaningful feature representations, we trained various autoencoders on
microbiome profiles. Our intuition behind the autoencoders was that the learned representation should keep
essential information in a condensed way because autoencoders are forced to prioritize which properties of
the input should be encoded during the learning process. We found that although the most appropriate
autoencoder usually allows for better representation that in turn results in better prediction performance,
what kind of autoencoder is appropriate highly depends on problem complexity and intrinsic properties of
the data.
In the previous study, it has been shown that adding healthy controls of the other datasets could improve
prediction performance assessed by AUC [27]. To check if this finding can be reproduced, for each dataset,
we added control samples of the other datasets only into the training set and kept the test set the same as
before. Appendix A Figure A6 shows the difference between the best performing models built with and
without additional controls. In general, prediction performance dropped (on average by 0.037) once
negative (control) samples are introduced to the training set across the datasets in almost all approaches
except only a few cases (Appendix A Figure A6). In contrast to the previous study, the result indicates that
the insertion of only negative samples into the training set may not help to improve the classification models,
and a possible explanation might be that changes in the models rarely contribute to improving the
classification of positive samples [47]. Interestingly, if we added negative samples into the whole data set
before split it into training and test set, we usually observed improvements in prediction performance.
However, we found that these improvements are trivial because introducing negative samples into the test
set easily reduces false positive rate (as the denominator of false positive rate formula is increased),
resulting in higher AUC scores.
Even though adding negative samples might not be helpful for a better model, it does not mean that
additional samples are meaningless. We argue that more samples can improve prediction performance,
especially when a well-balanced set of samples is augmented. To test this argument, we gradually increased
the proportion of the training set and observed how prediction performance changed over the training sets
of different sizes. Generally, improved prediction performance has been observed as more data of both
positive and negative samples are included (Appendix A Figure A7). With the continued availability of
large samples of microbiome data, the deep representation learning framework is expected to become
increasingly effective for both condensed representation of the original data and also downstream prediction
based on the deep representation.
19
Chapter 2: Generalizing predictions to unseen sequencing profiles via visual
data augmentation
2.1 Introduction
Predictive models relying on genomic signatures and biomarkers often suffer significantly inferior
performance in the independent validation on external data sets in biomedical research such as disease
diagnostics, prognostics, drug discovery, and precision medicine, resulting in a contribution to
reproducibility crisis [48-51]. Irreproducible models can lead to not only invalid conclusions misleading
subsequent studies but also a substantial waste of time and effort for researchers trying to commercialize
the models to benefit patients [52]. A major factor behind these failures is the lack of generalizability across
studies, in each of which the number of the heterogeneous data points is insufficient to obtain statistical
power to overcome the generalization barrier. In addition to the sample size, usually, there is a significant
gap between source data that are used to train classifiers and target data that are used to evaluate the
classifiers. One possible cause of the gap is the batch effect such as different sample cohorts, different lab
environments, and differences in experimental protocols across studies [51, 53], which violates the
assumption that source and target data are drawn from the same distribution.
In many real-world applications, trained systems fail to produce accurate predictions for unseen data with
the shifted distribution. For example, illumination or viewpoint changes in data acquisition for an object
detection system and noisier environments for a speech-to-text translation system could easily disrupt the
desired outcome. To address this issue, domain adaptation algorithms have been proposed to better align
source and target data in a domain-invariant feature space when knowledge of target domains is available
during the training phase [54-56]. However, in practice, it is common that no clue on the target domain is
provided. As a more ambitious goal, domain generalization studies focus on training a model generalizing
to the unseen domain without any foreknowledge of the unseen domain. Recent studies proposed different
ways of domain generalization such as extracting domain-invariant features [57-59], leveraging self-
supervised tasks to guide and learn robust representation [60], simulating domain shift in meta-learning
[61], and adding perturbed samples [62, 63]. Although these methods achieved promising performance on
benchmark data sets, their requirements, such as having datasets from multiple source domains or sufficient
enough for splitting and simulating domain shift, are often not satisfied in biomedical research where only
a limited number of heterogeneous data points in a single source domain is available.
20
Data augmentation techniques in the computer vision field show promising potential in improving
classifiers by reducing overfitting to source data [64-66]. Especially, recent advances in deep generative
models such as generative adversarial networks (GAN) [67] allow generating visual contents that are
indistinguishable from real ones and also augmenting image data to guide in finding better decision
boundaries [64, 66]. More recently, generative models have been utilized to augment medical images,
including Magnetic Resonance Images (MRI) [68], computed tomography (CT) [69], and X-ray images
[70]. However, there has been little effort in transferring the success in computer vision to biomedical
sequencing data [71]. Furthermore, it is unclear whether augmentation of sequencing data could overcome
the generalization barrier across different studies.
In this study, DeepBioGen, a data augmentation procedure that establishes visual patterns from sequencing
profiles and generates new sequencing profiles capturing the visual patterns based on conditional
Wasserstein GAN, is proposed to enhance the generalizability of the prediction models to unseen data.
DeepBioGen outperforms other augmentation methods in generalizing classifiers to unseen data. Also, the
classifiers generalized by DeepBioGen surpass state-of-the-art classifiers that are designed to work on
unseen profiles when tested on two scenarios: devising a prediction model for immune checkpoint blockade
(anti-PD1) responsiveness in melanoma patients based on RNA sequencing (RNA-seq) data and building
a diagnostic model for type 2 diabetes based on whole-genome metagenomic sequencing data. DeepBioGen
source code is free and available at https://anonymous.4open.science/r/dda7fadf-514e-41b9-a578-
9de25edb4a70/.
2.2 Results
Formation and augmentation of visual patterns of sequencing profiles
Sequencing profiles, such as RNA-seq measurements of gene expression levels, consist of numerical values
that indicate the activity of thousands of genes in different samples or patients. While many statistical
methods such as multivariate linear regression assume that variables are independent of one another, in
reality, genes’ activities are highly correlated [72]. In DeepBioGen, to take into account and visually
formalize the interactivity of related genes, similar features in the profiles were clustered together,
presenting visible patterns after converting numerical values to colors (Figure 4a). Subsequently, a
conditional Wasserstein GAN equipped with convolutional layers to capture the local visual patterns was
implemented to augment the sequencing profiles conditioned on class labels. During the augmentation
21
phase, multiple GANs were initialized and trained with different random seeds to promote diversity in the
augmented data points (Figure 4b).
To inspect the visual quality of augmented data, two different sequencing profiles were used to train the
generative models: one is RNA-seq expression profiles of melanoma patients, and the other is gut
microbiome profiles of type 2 diabetes patients. Visual assessment showed that the augmented profiles
preserved the boundaries of the clustered features and within-cluster color patterns in the same manner as
source data. It is also difficult to distinguish an augmented profile from source data without the original tag
(Appendix B Figure B1 and B2).
22
23
Figure 4. DeepBioGen, a sequencing profile augmentation procedure that generalizes classifiers to enhance
prediction performance on unseen data. a, Feature-wise clustering of sequencing profiles to form
perceptible visual patterns. b, Training multiple conditional Wasserstein GANs equipped with up-
convolutional and convolutional layers. c, Generating augmented data from the multiple generators of GAN
models and learning classifiers based on the augmented data along with source data to predict unseen data.
d-e, Results of anti-PD1 therapy response prediction on unseen data by the state-or-the-art and baseline
classifiers (gray) and by classifiers generalized with DeepBioGen (red), SMOTE (green), GMM (yellow),
and Random augmentation (blue); Classification algorithms: Support Vector Machine (SVM) and Neural
network (NN) which is a multi-layer perceptron; Evaluation metric: Area under the receiver operating
characteristics (AUROC). f-g, Results of type 2 diabetes prediction on unseen data.
Generalized classification on unseen sequencing profiles
The augmented data derived from the multiple generators of GANs were injected into training data along
with the source data. The training data was used to train three machine learning classifiers, support vector
machine (SVM), an artificial neural network (NN), and random forest (RF) (Figure 4c). The classifiers
were trained to predict non-responders of cancer immunotherapy (anti-PD1) based on RNA-seq gene
expression profiles or type 2 diabetes based on human gut microbiome profile.
To validate the generalizability of the classifiers, test (unseen) data were secured from studies that are
independent of the source studies. Classification performances on test data were evaluated using an area
under the receiver operating characteristics (AUROC) and an area under the precision-recall curve
(AUPRC). State-of-the-art predictors, TIDE [73] and IMPRES [74] for predicting patient response to anti-
PD1 therapy, and DeepMicro [75] for using deep representations of microbiome data to predict disease
states, were compared to DeepBioGen. Besides, widely-used data augmentation techniques, such as
Gaussian Mixture Model (GMM) [76] and Synthetic Minority Over-sampling Technique (SMOTE) [77],
were used to generate augmented data for comparison. The classifiers trained only on source data were used
as the baseline comparison.
Remarkably, DeepBioGen-based classifiers surpass not only state-of-the-art classifiers but also classifiers
that are trained on augmented data generated by different augmentation methods in both immunotherapy
response (Figure 4d-e and Appendix B Figure B3) and diabetes predictions (Figure 4f-g and Appendix B
Figure B3). Notably, even though DeepBioGen-based classifiers have no clue of test data, it outperforms
Gide et al.’s immune marker classifier (AUROC=0.77) that directly leverages the test data through
24
differential expression analysis [78]. Especially, DeepBioGen provides a stable performance boost to SVM
and NN classifiers for both problems as the augmentation rate increases. RF classifiers partially benefit
from DeepBioGen, showing generally worse performance than SVM and NN classifiers (Appendix B
Figure B4). Consistently, DeepBioGen reduces ℋ-divergence between the source data and the test data
more than other augmentation methods (Table 4).
Table 4. ℋ-divergence between source and test data
Data type DeepBioGen SMOTE GMM Random
RNA-seq tumor
expression profile 0.368 0.688 0.512 0.888
WGS human gut
microbiome profile 0.268 0.288 0.352 0.858
Impact of visual clusters and multiple generators
DeepBioGen uses the elbow method [79] to estimate the optimal number of visual clusters and GANs. To
assess the ability of the approach in inferring the ideal parameters based on source data only, DeepBioGen
models with a varying number of visual clusters or GANs were used to generate the augmented data for
training classifiers. The classification results of unseen data show that the elbow method elicits an optimal
or nearly optimal number of clusters and GANs in both immunotherapy response and diabetes prediction
problems (Appendix B Figure B5-B8).
Notably, the number of clusters has more impact on classification performance than the number of GANs,
suggesting that how sequencing data are clustered and thus presented visually plays a major role in
improving the generalizability of DeepBioGen (Appendix B Figure B5-B8). Results also show that diverse
generators of multiple Wasserstein GANs are more effective in diversifying the augmented sequencing data
than a single generator, thus leading to better generalizability (Appendix B Table B3).
Augmentations beyond the boundary of source data
To visualize how DeepBioGen augmented data to generalize classifiers, the source, augmented and test
data were embedded to 2-dimensional space with t-SNE algorithm [80]. In melanoma patient profiles, the
source and test data are placed distantly, while within-cluster data points with different anti-PD1 responses
are located closely in both data clusters (Figure 5a). The data embeddings were plotted separately for two
25
classes, and an empirical outer boundary of the source data based on the outermost data points heading
toward the test data was drawn with a red dotted line (Figure 5c and 5e). Interestingly, DeepBioGen
generated data points beyond the outer boundaries of the source data cluster (Figure 5d and 5f), whereas
other augmentation methods rarely produced data points that cross the boundaries (Appendix B Figure B7-
B9).
In microbiome profiles of healthy controls and diabetic patients, the test data cluster resides in the side
region of the source data cluster, thus depicting a moderately shifted distribution (Appendix B Figure B12).
DeepBioGen produced augmented microbiome profiles across boundaries of the source data cluster.
Particularly, the outermost augmented data points beyond the source boundaries are closely placed with
test data points that cross the border (Appendix B Figure B12), while other methods rarely generate data
points overpassing the boundaries (Appendix B Figure B13-B15).
26
Figure 5. t-SNE visualization of augmented tumor expression profiles derived from DeepBioGen along
with the source (grey), augmented (green), and test (unseen, red) data of melanoma patients treated with
anti-PD1 therapy. a, The source and test data. b, The source, test, and augmented data. c, Responders of the
source and test data; An empirical boundary of responders of source data (red dotted line). d, Responders
27
of the source, test, and augmented data. e, Non-responders of the source and test data; An empirical
boundary of non-responders of source data (red dotted line). f, Non-responders of the source, test, and
augmented data.
Progression-free survival analysis of predicted anti-PD1 treatment responders
For the predicted responder (PR) and non-responder (PNR) patients to anti-PD1 treatment determined by
DeepBioGen-supported SVM classifier, progression-free survival analysis was conducted to estimate the
clinical outcome. For comparison, state-of-the-art classifiers based on genomic signatures, IMPRES and
TIDE, were evaluated with the same analysis. With the DeepBioGen classifier or IMPRES, the PR group
has a significantly longer progression-free survival rate compared to the PNR group (Figure 6a and 6b),
whereas the two TIDE predicted groups do not show a significant difference.
Importantly, the median survival time of PRs classified by the DeepBioGen classifier was 755 days (95%
CI [335, N/A]), compared to 440 days (95% CI [125, N/A]) for the IMPRES classified PRs . Also, the
DeepBioGen classifier tends to be more sensitive in predicting responders than IMPRES, likely posing a
lower risk of unnecessary treatment suggestions often accompanied by unnecessary side effects (Figure 6
and Table 5).
Figure 6. Kaplan-Meier plots of progression-free survival for predicted responder (PR) and non-responder
(PNR) patients determined by three classifiers. a, generalized SVM classifier with DeepBioGen
augmentations. b, IMPRES. c, TIDE.
28
Table 5. Summary statistics for progression-free survival analysis
Classifier Prediction N
Median
survival time
(days)
95% CI MR* HR** 95% CI P-
value
DeepBioGen-
SVM
PR 27 755 [335, NA] 9.21 3.72 [1.88, 7.36] < 0.001
PNR 23 82 [76, 125]
IMPRES PR 40 440 [125, NA]
5.71 3.47 [1.66, 7.49] 0.002 PNR 10 77 [58, NA]
TIDE PR 40 231 [82, 870]
0.76 0.99 [0.45, 2.17] > 0.9 PNR 10 303 [96, NA]
*Median ratio; **Hazard ratio
2.3 Discussion
DeepBioGen is unique as it takes input sequencing profiles in machine-understandable visual form, while
visualization of sequencing data (e.g. heatmap of differentially expressed genes) has been typically used to
present findings in a human-understandable manner. One potential advantage of feeding DeepBioGen with
visually recognizable data is that visual patterns difficult to be identified with human eyes may be captured
and characterized in embedding space.
Even with a limited amount of source data, DeepBioGen can alleviate batch effects of independent studies
without details for batch correction such as sample cohorts, lab environments, and experimental protocol,
by reducing the gap between the source and unseen data. Also, DeepBioGen is highly extensible to other
biological data whose feature dependency is not negligible.
2.4 Methods
Sequencing profiles and pre-processing
Clinical genomic data containing RNA-seq tumor expression profiles of melanoma patients and their
responsiveness to anti-PD1 therapy were secured from three independent studies [78, 81, 82] (Appendix B
Table B1). Fifty samples in the most recent study31 were used as test data and the others were used as source
data. RNA-seq read counts were normalized to transcripts per million (TPM) and then log2-transformed.
To focus on genes related to primary mechanisms of tumor immune evasion, recently identified T cell
29
signature genes [73], such as regulators of T cell dysfunction and suppressors of T cell infiltration into the
tumor, were selected out of 18,570 common genes across the studies. In total, 702 genes were considered
as features of initial inputs.
Human gut metagenomic sequencing reads of type 2 diabetic patients and healthy controls were acquired
from two independent studies: one on the Chinese cohort [32] and the other on the European women cohort
[31] (Appendix B Table B1). Using MetaPhlAn2 [23], strain-level marker profiles were extracted from the
metagenomic samples. In total, the number of common strain-level markers that are considered as initial
features was 74,240. The European samples in the more recent study were used as test data and Chinese
samples as source data.
Formation of visual patterns from sequencing profiles
Each measurement in source data was standardized by subtracting the mean and dividing by the standard
deviation. The same standardization was applied to test data using the mean and standard deviation of
source data. To meet the dimensional requirement of the pre-defined input layer, the extremely randomized
trees [83] feature selection algorithm was applied to the source data to select 256 features. The k-means
clustering algorithm was used to cluster features. Based on the elbow point where the within-cluster sum
of squared errors (WSS) starts to decrease significantly, the optimal number of clusters was determined to
be 4 for RNA-seq tumor expression profiles and 6 for human gut microbiome profiles (Appendix B Figure
B16). The selected features were then sorted and rearranged by cluster labels so that similar features are
placed nearby. The features of test data were also rearranged in the same order.
Augmentation of sequencing profiles based on their visual patterns
DeepBioGen captures local visual patterns of sequencing profiles by training conditional Wasserstein GAN,
whose generator and critic networks are composed of up-convolutional and convolutional layers,
respectively. The generator tries to generate realistic images enough to fool the critic, whereas the critic
tries to assign higher values for real images than for generated images. During training, the generator and
the critic progressively become better at their jobs by competing against each other. This adversarial
training can be conducted by optimizing a minimax objective. Wasserstein distance (or Earth Mover)
formulated by Kantorovich-Rubinstein duality is used in the objective term for better reaching Nash
equilibrium [84]. Also, the gradient penalty is applied to the objective function to enforce the Lipschitz
30
constraint, alleviating potential instability in the critic [85]. Generator function 𝐺 and critic function 𝐶 are
conditioned on the class label 𝑦 and the final objective function of conditional Wasserstein GAN is as
follows:
min𝐺
max𝐶
𝔼𝑧~𝑝(𝑧)[𝐶(𝐺(𝑧|𝑦))] − 𝔼𝑥~𝑃𝑟[𝐶(𝑥|𝑦)] − 𝔼�̂�~𝑃�̂�
[(‖∇�̂�𝐶(�̂�|𝑦)‖2 − 1)2]
where 𝑧 denotes a random noise vector derived from random noise distribution 𝑝(𝑧) , 𝑥 a real profile
derived from the real data distribution 𝑃𝑟, and �̂�~𝑃�̂� sampling uniformly along straight lines connecting the
real data distribution 𝑃𝑟 and the output distribution of generator 𝑃𝑔 = 𝐺(𝑧|𝑦). The gradient penalty term
directly constrains the norm of the critic’s output concerning its input, enforcing the Lipschitz constraint
along the straight lines.
The architecture of neural networks that approximate generator function 𝐺 and critic function 𝐶 is
illustrated in Appendix B Figure B17. The generator begins with two input layers, one for receiving a
random noise vector and the other for a class label, followed by dense and embedding layers. Embedded
random noise vector and label vector are reshaped and concatenated. Subsequently, two up-convolutional
blocks, composed of an up-convolutional layer, batch normalization layer, and Leaky ReLU activation layer,
perform inverse convolution operations. Lastly, the final up-convolutional layer produces generated
sequencing profile. Note that each sequencing profile is considered as a 1x256 pixel image in a single
channel. Similarly, the critic has two input layers, one for sequencing profile and the other for a class label,
which is embedded, reshaped, and concatenated onto the sequencing profile vector. The two consecutive
convolutional blocks, each of which consists of a convolutional layer, Leaky ReLU activation, and dropout
layer, are followed by the output layer with a single unit. Across the generator and critic, the alpha value of
Leaky ReLU is set as 0.3, and the dropout rate is set at 0.3.
To achieve better generalization, multiple clones of the GAN are trained in the same way except for initial
weights in the neural networks. The number of desired GANs is estimated by approximating modes of
samples with the elbow method under the assumption that most modes are generated if the number of
generators is at least as many as the number of modes in source data (Appendix B Figure B18). Individual
generators produce the same number of augmented data points.
Generalized predictions on unseen sequencing profiles
31
To generalize classifiers predicting clinical outcomes or disease states to unseen data, three classifiers, SVM,
NN, and RF, were built on training data composed of source and augmented sequencing profiles. Hyper-
parameters of the classifiers were optimized based only on source data with a 5-fold cross-validation
scheme. Grid search was applied to explore hyper-parameter space (see details in Appendix B Table B2).
With the best hyper-parameters, prediction models were trained on the pooled source and augmented data.
The generalizability and performance of the prediction models were evaluated on the unseen test data using
AUROC and AUPRC. The performance evaluation was repeated by gradually changing the augmentation
rate indicating how many times the size of augmented data is of the source data.
For comparison, state-of-the-art classifiers designed to work on unseen data, including TIDE [73], IMPRES
[74], and DeepMicro [75], were evaluated on test data. TIDE predicts anti-PD1 responsiveness of
melanoma patients based on genome-wide expression signatures of T cell dysfunction and exclusion. To
satisfy its requirement, the test data without filtering out any genes from the original data was submitted to
the TIDE response prediction web service. IMPRES is a predictor of anti-PD1 response in melanoma
patients, which is a rule-based classifier manually built based on gene expression relationships between
immune checkpoint gene pairs. Its source code was utilized to evaluate the performance of IMPRES on the
test data. DeepMicro is a deep representation learning framework for improving predictors based on
microbiome profiles. The source data was utilized to learn a low-dimensional representation of the
microbiome data, and classifiers were then trained on the representation and evaluated on the test data.
Furthermore, as an alternative to DeepBioGen, widely-used data augmentation approaches, including
GMM [76] and SMOTE [77], as well as statistics-based random augmentation were evaluated. An
independent GMM model was fitted for each class label, and the optimal number of components in the
GMM model was estimated with the Bayesian information criterion (BIC). SMOTE derives the generated
samples from linear combinations of nearest neighboring samples. Random augmentation draws data points
from the normal distribution whose mean and standard deviation are the same as those of the source data,
assigning an arbitrary class label. Also, as a baseline comparison, machine learning classifiers that are
trained only on source data (i.e., no augmented data) were evaluated on test data.
To understand the impact of generalization on reducing the discrepancy between the source and test data,
a classifier-induced divergence measure, ℋ-divergence, was determined with various classifiers. For a
given set of binary hypotheses ℋ ⊆ {ℎ: 𝑋 → {0,1}} , ℋ -divergence is the largest possible difference
between probabilities of being classified as 1 in source and test distributions [86, 87]. More formally, the
empirical ℋ-divergence can be written as:
32
𝑑ℋ(𝐷𝑆 , 𝐷𝑇) = 2 supℎ∈ℋ
|𝑃𝑥~𝐷𝑆[ℎ(𝑥) = 1] − 𝑃𝑥~𝐷𝑇
[ℎ(𝑥) = 1]|
where 𝐷𝑆 and 𝐷𝑇 are the source and test data, respectively, and
𝑃𝑥~𝐷[ℎ(𝑥) = 1] =|{𝑥: 𝑥 ∈ 𝐷, ℎ(𝑥) = 1}|
|𝐷|
As a proxy of ℋ for each augmentation method, all classifiers trained on the augmented training data by
varying an augmentation rate and classification algorithms were included in a set of binary hypotheses.
Impact of multiple generators on the diversity of generated sequencing profiles
Wasserstein GAN may suffer less from mode collapse than infant GAN relying on Jensen-Shannon
divergence in its loss term [84]. However, a single Wasserstein GAN may not be able to produce all modes
of data, and it can be hypothesized that multiple Wasserstein GANs may increase the diversity of augmented
sequencing profiles. To evaluate the diversity of the augmented profiles generated with multiple
Wasserstein GANs, the adapted inception score is used. Originally, the inception score was introduced to
evaluate the quality and diversity of generated images based on the predicted class probability distributions
derived from a pre-trained Inception v3 model [88]. More recently, Gurumurthy et al. suggested a modified
inception score considering within-class diversity of the generated data [89], and this scoring method is
used in the current evaluation. Also, according to the note that non-ImageNet data generator should not be
evaluated by the Inception v3 classifier [90], it is replaced with the best performing baseline-classifier
trained only on source data. Consequently, the adapted inception score ranges from 1 to 2, and the higher
the score, the better the diversity and quality of the augmented profiles.
t-SNE visualization of the augmented data
To visualize how augmented data is arranged in a high-dimensional space, the augmented data along with
source and test data was embedded into a 2-dimensional space using t-SNE. Also, a class-specific boundary
of the source data cluster facing the test data cluster in the embedded space was drawn with one or two
straight lines through the outermost data points of the source data cluster.
Progression-free survival analysis
33
The Kaplan-Meier plots were drawn to conduct progression-free survival analysis for predicted responder
and non-responder patients. For each classifier, a receiver operating characteristic (ROC) curve was used
to determine the cut-off value of predictions. The closest point from (0, 1) on the ROC curve was chosen,
at which the threshold well balancing true positive rate and false-positive rate is identified. The log-rank
test was used to validate statistical significance.
34
Chapter 3: Deep generalized interpretable autoencoder elucidates gut
microbiota for better cancer immunotherapy
3.1 Introduction
Recent studies have found that the composition of the gut microbiome modulates the response to cancer
immunotherapies [91-93]. Immune checkpoint inhibitors (ICIs) that block immunosuppressive molecules
of tumor cells, thereby inducing host immune response are highly effective for only a subset of patients
(~40%) [94]. The gut microbiome has been reported as a major extrinsic modulator to responses of ICIs
such as anti-PD-1. In mice, fecal microbiota transplantation (FMT) from responders to nonresponders
promotes the efficacy of anti-PD-1 therapy in nonresponders [91-93]. More recently, first-in-human clinical
trials observed the clinical benefit of responder-derived FMT in melanoma patients [95, 96]. Although a
favorable gut microbiome is associated with response to anti-PD-1 therapy, its composition and the specific
mechanisms affecting host immune response remain unclear [97].
Determining the key microbiota affecting individual responses to cancer treatment is crucial for advancing
precision oncology. However, this is challenging due to the limited available data sets, thereby lack of
generalizability in statistical and machine learning models. For example, multiple studies on small
melanoma cohorts have reported gut bacteria associated with response to ICI therapy [91, 92, 98-100], but
unfortunately, there are discrepancies in the findings [97]. Many bacteria reported by those studies did not
appear in multiple studies at the species level except Faecalibacterium prausnitzii and Bacteroides
thetaiotaomicron. Also, previous attempt to train machine learning classifiers on microbiome profiles has
shown relatively low accuracy in the prediction of ICI response on unseen data [101]. This suggests the
need for curation of massive-scale studies to obtain statistical power to generalize microbial signatures to
unseen data.
Nevertheless, recent advances in artificial intelligence, especially deep learning models for domain
generalization may hold promise in generalizing microbial signatures. Domain generalization, also called
out-of-distribution generalization, aims at learning models that can be generalized to an unseen domain
without any foreknowledge [102]. Domain generalization techniques usually require data from multiple
domains or sufficient enough to simulate domain shifts, and the limited availability of microbiome data
often restricts the application of the techniques. However, more recent studies proposed data augmentation
approaches, circumventing the limitation. Especially, DeepBioGen showed promise in augmenting limited
sequencing data, including microbiome profiles, and improving the generalizability of classification models.
35
Well-generalized and accurate deep learning models have the potential to be a key part of clinical decision-
making in precision medicine [103, 104]. Despite the remarkable performance, deep learning models are
usually black-box and difficult to interpret, which hampers their adoption in clinical practice as clinicians
and decision-makers prioritize the explainability of the predictions [105]. Also, interpretable models may
provide useful insight into the underlying mechanisms connecting gut microbiome and host immune
response.
In this study, DeepGeni, a deep generalized interpretable autoencoder, is proposed to unveil the gut
microbiome associated with ICI response (Figure 7). The previous study has shown that a deep autoencoder
can produce a highly effective representation of microbiome profiles [75]. Also, a flexible autoencoder
model has been developed for interpretable autoencoding without a significant loss of reconstruction
accuracy [106]. By augmenting microbiome profiles with DeepBioGen and by introducing explainable
links in the autoencoder, DeepGeni improved the generalizability and interpretability of the learned
representation of microbiome profiles. DeepGeni-based classifiers outperform a state-of-the-art classifier
in predicting ICI response using microbiome profiles. Also, interpretable links of DeepGeni reveal
important taxa for ICI response prediction, and the identified taxa are either associated with prolonged
progression-free survival in melanoma patients treated with ICI therapy or differentially abundant between
responders and non-responders.
36
Figure 7. Overview of DeepGeni analysis
3.2 Methods
Datasets
Gut microbiome data of melanoma patients treated with ICI therapy were collected from four shotgun
metagenomic studies [91, 92, 99, 107]. This study focused on samples gathered before ICI therapy and
excluded the other samples taken after ICI administration. Patients’ responsiveness to ICI therapy was
37
evaluated with RECIST 1.1 criteria where complete or partial responses are classified as responders and
stable or progressive disease states as non-responders [108]. Since Peters et al.’s data did not have an
explicit classification of responsiveness, patients with over 6 months of progression-free survival were
regarded as responders and the others as non-responders as suggested by Limeta et al. [101]. In total, 130
melanoma patients (66 responders and 64 non-responders) were used (Table 6).
Raw sequencing reads were filtered with FASTP and processed with mOTUs2, a phylogenetic z (mOTU)
profiler [109, 110]. Processed microbiome profiles containing read counts for each phylogenetic marker
gene and each patient were acquired from Limeta et al. [101]. Read counts were normalized by the total
number of reads for each patient, and then log2-transformed. In total, 7,727 mOTUs (features) were
considered in an initial input.
Table 6. Summary of gut microbiome datasets derived from shotgun metagenomic sequencing
Dataset name # of total
samples
# of
responders
# of non-
responders
Publis
hed
year
ICI therapy Referenc
e
Gopalakrishnan 25 14 11 2018 Anti-PD-1 [91]
Matson 39 15 24 2018 Anti-PD-1 [92]
Frankel 39 19 20 2017 Anti-PD-1, Anti-CTLA-
4, and both
[99]
Peters 27 18 9 2019 Anti-PD-1, Anti-CTLA-
4, and both
[107]
Microbiome profile augmentation with DeepBioGen
DeepGeni utilizes DeepBioGen, a sequencing profile augmentation procedure that generalizes the
subsequent trainable models with the augmented data (Figure 7a). Visual patterns of source microbiome
profiles are established with feature selection followed by feature-wise clustering. Wasserstein generative
adversarial network (GAN) equipped with convolutional layers capturing the visual patterns generates
realistic profiles and augments source data. The augmented training data can enhance the generalizability
38
of the subsequent models such as machine learning classifiers to unseen data. In this study, DeepBioGen
parameters were set to default, otherwise, configured following the guideline described in the original paper.
Test data has been excluded from any estimation of the parameters. Out of 7,727 mOTU features, 256
features were selected by fitting extremely randomized trees on source data [83]. The number of feature-
wise clusters and the number of GAN models were estimated by calculating the within-cluster sum of
squared errors in source data with reduced features.
Generalized autoencoder with interpretable links
Autoencoder consists of encoder and decoder functions that are approximated by neural networks. The
encoder maps the input data points into latent space and the decoder reconstructs the input from the mapped
latent representations. During training, the autoencoder tries to minimize the gap between the input and the
reconstruction by adjusting weights of neural networks based on back-propagated signals from
reconstruction loss term. Formally, the reconstruction loss can be written as,
𝐿(𝑥, 𝑥′) = ‖𝑥 − 𝑥′‖2 = ‖𝑥 − 𝑓′𝜃 (𝑓𝜙(𝑥))‖2,
where 𝑥 and 𝑥′ are the input and the reconstruction, 𝑓𝜙(∙) and 𝑓′𝜃(∙) are encoder and decoder functions in
which 𝜙 and 𝜃 are their weights, respectively. The latent representation usually has a smaller dimension
than the original input but it contains concentrated information that can be used to reconstruct the original
input with minimal error. Although the latent representation may hold essential information in a condensed
form, it is not directly interpretable because of the non-linear relationship between latent and original
features.
Svensson et al. suggested a flexible autoencoder model removing non-linearity in decoder function, opening
up a possibility to retain interpretability without ruining reconstruction quality [106]. The non-linearity of
the autoencoder comes from a non-linear activation function applied to the weighted sum of the preceding
inputs. By removing the activation function in the decoder part, direct linear links from the latent layer to
the output layer can be obtained. In this study, simple autoencoder architectures composed of three dense
layers were utilized: input layer, latent layer, and output layer. The number of nodes of the input and output
layer is the same as that of the input. Four different sizes of latent nodes were examined: 128, 64, 32, and
16. The augmented training data consisting of source and augmented data was used to train the autoencoder.
After training, the encoder part was used to produce latent representations of the augmented training data.
Test data was isolated from any steps of autoencoder training.
39
Generalized latent representations for predicting ICI responses
To estimate the usefulness of the latent representations derived from the generalized autoencoder,
prediction models classifying ICI responses were built on the representations (Figure 7b). Three machine
learning algorithms were used to train the models: support vector machine (SVM), random forest (RF), and
feedforward neural network (NN) that is a multi-Layer perceptron. Prediction performance was evaluated
on two different validation settings. The first one, according to the suggestion of Limeta et al., utilizes the
most recent data set (Peters) as test data and the integration of the rest as source data. The other setting is
cross-study validation that iterates over datasets, leaves one dataset as test data, uses the rest as source data,
and averages over results. In both settings, five-fold cross-validation on the learned representation of source
data was conducted to optimize hyper-parameters of the classification algorithms. Hyper-parameter space
was explored with grid search and the parameter grid is described in Appendix C Table C1. With the best
hyper-parameters, classifiers were trained on representations of the entire source data and evaluated on test
data. Area under the receiver operating characteristics curve (AUC) was used to assess the prediction
performance.
Extracting informative microbiota from interpretable autoencoder
To interpret the latent representations that improve the prediction of ICI response, the most informative
latent variables were selected based on feature importance estimated by extremely randomized trees [83].
The informative signals of the selected latent variables were propagated through direct links in the decoder
network (Figure 7c). Out of 128 latent variables, ten of the most informative variables were considered for
further analysis. For each variable, the links were ranked by the absolute value of their weights and, out of
256 links, the top 20 were selected. After the corresponding output nodes connected to the top 20 links were
mapped to mOTUs in a one-to-one manner, the specified 20 mOTUs were listed into a set of candidates.
By iterating over the ten latent variables, the ten sets of candidates were merged into a unique set of
candidates. The whole process was repeated four times by dropping one data set at a time and using the rest
for better generalizability. The finalist was acquired by taking the intersection of the four sets of candidates
and it contains 14 mOTUs.
Statistical Analysis
40
To assess the impact of the identified informative mOTUs on ICI responsiveness, progression-free survival
analysis that is a primary endpoint of clinical oncology studies was conducted. Peters et al. (N=27) had
continuously followed the duration of progression-free survival and was used in the analysis. For each
mOTU, the second quartile (median) was used as a cut-off for high abundance. The Kaplan-Meier plot was
drawn and the log-rank test was conducted to validate statistical significance. For testing differentially
abundant taxa, the Wilcoxon rank-sum or Mann-Whitney test was used.
3.3 Results
Improved prediction of ICI response with generalized interpretable autoencoder
We evaluated the prediction performance of machine learning classifiers utilizing DeepGeni, a deep
generalized interpretable autoencoder. The classifiers were learned to predict a binary class of ICI treatment
(responder vs non-responder) based on the latent representation of microbiome profiles. Test data has been
excluded from the whole process of generalizing and training the autoencoder of which encoder part
produces the latent representation. DeepGeni-based classifiers were compared to classifiers trained on three
different settings without augmentation: 1) Initial data of 7,727 mOTU features without feature selection
or latent encoding, 2) Feature selected data (256 mOTU features) without latent encoding, 3) Feature
selected data with latent encoding. For each approach, out of three classification algorithms (SVM, RF, and
NN), the best performing one was selected. Also, the state-of-the-art approach that selects differentially
abundant mOTU features and applies a random forest classification algorithm was included in the
comparison. As an independent validation setting, the most recent study’s data (Peters) was used as test
data and the rest as source data for training classifiers.
41
Figure 8. Receiver operating characteristics (ROC) curves of the best classifier for each method
Remarkably, the DeepGeni-based NN classifier surpasses not only the state-of-the-art classifier (Limeta et
al.) but the best classifiers of other approaches in (Figure 8). In addition, the rest of the DeepGeni-based
classifiers (SVM and RF) show better performance than the classifiers of other approaches (Appendix C
Table C2). Also, the DeepGeni-based SVM classifier outperforms other classifiers in the cross-study
validation setting, displaying the highest generalizability across different studies (Table 7).
Table 7. Averaged AUC in cross-study validation setting
Approach No FS FS only FS + AE DeepGeni
(FS + DBG + AE)
Algorithm SVM RF NN SVM RF NN SVM RF NN SVM RF NN
AUC 0.52 0.522 0.556 0.564 0.551 0.585 0.602 0.57 0.598 0.626 0.579 0.609
STD 0.156 0.074 0.07 0.107 0.103 0.08 0.06 0.053 0.045 0.209 0.09 0.221
- FS: feature selection; AE: autoencoder; DBG: DeepBioGen
Key microbiota relevant to ICI response extracted from generalized interpretable autoencoder
The finalist of ICI-response-relevant key microbiota was identified by propagating informative signals
through the interpretable links from latent variables that play a major role in inducing the superior ICI
response prediction. The finalist consisting of fourteen mOTUs categorized into seven families were
validated with previous literature and statistical tests. Previous studies have reported twelve of the fourteen
in the upper taxonomic levels. However, the finalist generally shows a higher resolution of microbiota
associated with ICI therapy in taxonomic identification (Table 8). Interestingly, two novel ICI-therapy-
relevant gut bacteria, Eggerthella lenta and unknown Lactobacillales, were identified that were not detected
in previous studies. It is worth noting that the genus Subdoligranulum is closely related to the
Faecalibacterium genus Furthermore, five species, including Lactobacillus plantarum, unknown
Ruminococcaceae, and three unknown Clostridiales, displayed statistical significance in differentially
42
abundant testing (unadjusted, Wilcoxon’s rank-sum test). Besides, a high abundance of unknown
Eubacterium species was significantly associated with prolonged progression-free survival in ICI-treated
melanoma patients (Figure 9).
Table 8. The finalist of ICI-response-relevant key microbiota
mOTU_v2
ID
Consensus
taxonomy Order Family Genus
Specifi
ed level
Prev
level
H-
Re
s
P-
val
ref_mOTU
_v2_0036
Enterobacteriaceae
sp. Enterobacteriales Enterobacteriaceae
Escherichia
/Shigella Species
Species
[92] -
ref_mOTU
_v2_0154
Lactobacillus
plantarum Lactobacillales
Lactobacillaceae Lactobacillus Species Family
[92]
Ye
s *
meta_mOTU
_v2_6288
unknown
Lactobacillales
unknown
Lactobacillales unknown Family - -
ref_mOTU
_v2_0642 Eggerthella lenta Eggerthellales Eggerthellaceae Eggerthella Species - -
ref_mOTU
_v2_0884
Anaerotruncus
colihominis
Clostridiales
Ruminococcaceae
Anaerotruncus Species Family
[91]
Ye
s
ref_mOTU
_v2_4738
Subdoligranulum
sp. Subdoligranulum Species
Family
[91]
Ye
s
ref_mOTU
_v2_0281
Ruminococcus
lactaris Ruminococcus Species
Genus
[91, 98]
Ye
s
meta_mOTU
_v2_6557
unknown
Ruminococcaceae
unknown
Ruminococcaceae Genus
Family
[91]
Ye
s **
meta_mOTU
_v2_6657
unknown
Eubacterium Eubacteriaceae Eubacterium Species
Genus
[91, 98]
Ye
s #
meta_mOTU
_v2_5411
unknown
Clostridiales
unknown
Clostridiales unknown Family
Order
[91]
Ye
s
meta_mOTU
_v2_5669
unknown
Clostridiales *
meta_mOTU
_v2_6760
unknown
Clostridiales *
meta_mOTU
_v2_6795
unknown
Clostridiales *
meta_mOTU
_v2_7550
unknown
Clostridiales
- *: p < 0.05, Wilcoxon’s rank-sum test on differential abundance; **: p <0.01, Wilcoxon’s rank-sum test; #: p < 0.05, log-rank
test on progression-free survival distribution difference; H-Res indicates whether the specified taxonomic level is in higher
resolution than the previously specified level in other studies.
43
Figure 9. Kaplan-Meier plot of progression-free survival by relative abundance of unknown Eubacterium
species
3.4 Discussion
DeepGeni is a generalized interpretable autoencoder not only boosting ICI response prediction accuracy in
an independent study but providing interpretable links to identify informative taxa contributory to modulate
ICI response. The improved generalizability of DeepGeni is supposed to be derived from augmented
microbiome data generated by DeepBioGen, a GAN-based data augmentation procedure. The latent
representation learned by the generalized autoencoder with the augmented data can enable to train
classifiers more resilient to unseen data distributions. Also, DeepGeni extracted microbial species
informative to predict ICI response in higher resolution than other studies. The specified species could be
a helpful basis for establishing ICI-promoting FMT guidelines to specify donor and donee. Moreover, the
identified species may offer a possibility to develop pre or probiotics targeting improved outcomes of ICI
therapy.
Although this study produces the generalized list of ICI-response-relevant key microbial taxa over the
available datasets, the ability to statistically validate the identified microbial taxa is bounded by the size of
44
the available data. This could limit the possibility of being validated for some of the key microbial taxa as
they were identified by taking advantage of the out-of-distribution augmented data and it may not be eligible
to use the augmented data for statistical validation. However, there still remains the possibility of being
validated in larger data sets once they become available.
DeepGeni was specifically applied to examine microbiome modulating ICI response in this study but it is
highly extensible for identifying microbiome-driven human phenotypes or even for applying other types of
biological and ecological data such as genome and metagenome profiles.
45
Conclusion
In this thesis, various deep learning models were developed to address the limitations of utilizing omics
data to promote precision medicine. These models were learned to produce effective secondary data
improving classification performance and interpretability of predicted outcome. The achievements may
facilitate the adoption of novel classification techniques and, therefore, the establishment of the standard
clinical decision-making process in precision medicine. The main conclusive deliverables and prospects of
this thesis are listed as follows:
1. DeepMicro is publicly available software that offers cutting-edge deep learning techniques for
learning meaningful representations of the given data. Researchers can apply DeepMicro to their
high-dimensional microbiome data to obtain a robust low-dimensional representation for the
subsequent supervised or unsupervised learning. For problems such as drug response prediction,
forensic human identification, and food allergy prediction using microbiome data, deep
representation learning might be useful for boosting the model performance. Moreover, it might be
worthwhile to use the learned representation for clustering analysis. Data points in the latent space
can be clustered, which may help capture the shared characteristics within groups that may not be
clear in the original data space. DeepMicro has been used to deal with microbiome data but can be
extended to various omics data such as genome and proteome data.
2. DeepBioGen provides a framework for effective data augmentation in sequencing profiles that can
be used to boost the training data and improve the performance of prediction models on unseen
data. It adversarially learns multiple generative models that capture visual signals from source data.
With multiple generators, DeepBioGen generates realistic augmented data beyond the boundary of
the source domain. The augmented data can be used to amplify training data and train classifiers
resilient to unknown domain shifts. Consequently, DeepBioGen can improve the transferability and
reproducibility of the prediction models without any knowledge of unseen data. In the future study,
it is envisioned that the process of forming visual patterns from sequencing profiles can be learned
with cutting-edge machine learning models toward the better formation of machine-understandable
patterns.
3. DeepGeni is a generalized interpretable autoencoder not only boosting ICI response prediction
accuracy in an independent study but providing interpretable links to identify informative taxa
46
contributory to modulate ICI response. DeepGeni was specifically applied to examine microbiome
modulating ICI response in this study but is highly extensible for identifying microbiome-driven
human phenotypes or even for applying other types of biological and ecological data such as
genome and metagenome profiles. Also, for these profiles, DeepGeni can provide a reasonable
explanation from black-box models with interpretable links. In the future, it is envisioned that
interpretable links are extended into the subsequent classification models and individualized
explanations for each outcome of the prediction.
47
Appendix A
Contents
• Figure A1. Performance evaluation scheme
• Figure A2. Disease prediction performance for abundance profile-based models
• Figure A3. Disease prediction performance for different autoencoders based on abundance
profile (assessed with AUC)
• Figure A4. Disease prediction performance of multi-layer perceptron without representation
learning based on marker profile
• Figure A5. Disease prediction performance of multi-layer perceptron without representation
learning based on abundance profile
• Figure A6. Impact of introducing negative samples into the training set on AUC
• Figure A7. Prediction performance changes over the increasing data points in the training set
• Figure A8. Disease prediction performance for marker profile-based models (fixed scale)
• Table A1. The best representation learning model structures for each dataset
• Table A2. Hyper-parameters used in grid search
• Table A3. Performance evaluation with area under precision-recall curve for IBD dataset
48
Figure A1. Performance evaluation scheme
49
Figure A2. Disease prediction performance for abundance profile-based models. Prediction performance
of various methods built on marker profile has been assessed with AUC. MetAML utilizes support vector
machine (SVM) and random forest (RF), and the superior model is presented (green). Principal component
analysis (PCA; blue) and gaussian random projection (RP; yellow) have been applied to reduce dimensions
of datasets before classification. DeepMicro (red) applies shallow autoencoder (SAE), deep autoencoder
(DAE), variational autoencoder (VAE), and convolutional autoencoder (CAE) for dimensionality reduction.
Then SVM, RF, and multi-layer perceptron (MLP) classification algorithms have been used.
50
Figure A3. Disease prediction performance for different autoencoders based on abundance profile
(assessed with AUC). Classifiers used: support vector machine (SVM), random forest (RF), and multi-layer
perceptron (MLP); Autoencoders used: shallow autoencoder (SAE), deep autoencoder (DAE), variational
autoencoder (VAE), and convolutional autoencoder (CAE)
51
Figure A4. Disease prediction performance of multi-layer perceptron without representation learning
based on marker profile
52
Figure A5. Disease prediction performance of multi-layer perceptron without representation learning
based on abundance profile
53
Figure A6. Impact of introducing negative samples into the training set on AUC
54
Figure A7. Prediction performance changes over the increasing data points in the training set
55
Figure A8. Disease prediction performance for marker profile-based models (fixed scale).
56
Table A1. The best representation learning model structures for each dataset
Microbiome
Profile Type Dataset
Size of Original
Dim#
Representation
Learning Model
Encoder
Structure*
Size of
Latent Dim Classifier
Averaged AUC
(Standard Error)
Averaged Accuracy
(Standard Error)**
Strain-level
marker
profile
IBD 91,756
SAE 64 64 SVM 0.955 (0.013) 0.773 (0.000)
DAE 512-256-128 128 RF 0.911 (0.046) 0.855 (0.027)
VAE 128-4 4 MLP 0.899 (0.039) 0.818 (0.014)
CAE 8-4 1,936 RF 0.929 (0.010) 0.882 (0.011)
EW-T2D 83,456
SAE 256 256 RF 0.899 (0.046) 0.800 (0.047)
DAE 256-128-64 64 RF 0.840 (0.029) 0.730 (0.041)
VAE 256-16 16 SVM 0.853 (0.041) 0.600 (0.039)
CAE 8-4 1,764 SVM 0.796 (0.014) 0.670 (0.030)
C-T2D 119,792
SAE 512 512 SVM 0.762 (0.008) 0.664 (0.021)
DAE 256-128 128 RF 0.702 (0.029) 0.649 (0.019)
VAE 128-16 16 SVM 0.719 (0.019) 0.664 (0.022)
CAE 4-2 968 MLP 0.763 (0.014) 0.710 (0.008)
Obesity 99,568
SAE 512 512 MLP 0.658 (0.045) 0.624 (0.027)
DAE 256-128 128 RF 0.659 (0.034) 0.635 (0.012)
VAE 512-8 8 RF 0.599 (0.014) 0.639 (0.013)
CAE 64-32 16,928 RF 0.622 (0.012) 0.655 (0.008)
Cirrhosis 120,553
SAE 256 256 SVM 0.928 (0.006) 0.821 (0.020)
DAE 512-256-128 128 SVM 0.903 (0.011) 0.809 (0.012)
VAE 256-8 8 SVM 0.891 (0.016) 0.792 (0.029)
CAE 16-8 3,872 SVM 0.940 (0.006) 0.864 (0.008)
Colorecta
l 108,034
SAE 32 32 MLP 0.799 (0.058) 0.752 (0.039)
DAE 512-256-128 128 MLP 0.803 (0.072) 0.728 (0.046)
VAE 256-8 8 RF 0.737 (0.068) 0.696 (0.037)
CAE 4-2-1 441 MLP 0.789 (0.044) 0.744 (0.033)
Species-
level
relative
abundance
profile
IBD 443
SAE 512 512 MLP 0.817 (0.031) 0.782 (0.017)
DAE 512-256 256 MLP 0.779 (0.039) 0.791 (0.037)
VAE 32-8 8 RF 0.779 (0.032) 0.782 (0.017)
CAE 32-16-8 3,872 RF 0.873 (0.030) 0.809 (0.017)
EW-T2D 381
SAE 256 256 SVM 0.640 (0.033) 0.630 (0.037)
DAE 1024-512 512 SVM 0.612 (0.060) 0.580 (0.026)
VAE 64-8 8 RF 0.640 (0.051) 0.570 (0.047)
CAE 16-8 3,200 RF 0.829 (0.039) 0.740 (0.037)
C-T2D 572
SAE 64 64 SVM 0.715 (0.023) 0.635 (0.030)
DAE 128-64 64 SVM 0.711 (0.026) 0.649 (0.026)
VAE 512-16 16 SVM 0.715 (0.031) 0.652 (0.031)
CAE 4-2-1 576 RF 0.725 (0.025) 0.644 (0.025)
Obesity 465
SAE 128 128 MLP 0.645 (0.030) 0.659 (0.017)
DAE 1024-512 512 MLP 0.631 (0.051) 0.612 (0.020)
VAE 256-4 4 MLP 0.600 (0.030) 0.635 (0.012)
CAE 4-2 968 RF 0.674 (0.034) 0.655 (0.013)
Cirrhosis 542
SAE 32 32 SVM 0.801 (0.035) 0.723 (0.050)
DAE 1024-512 512 MLP 0.806 (0.017) 0.706 (0.030)
VAE 512-8 8 SVM 0.781 (0.021) 0.711 (0.035)
CAE 16-8-4 1,461 RF 0.888 (0.011) 0.830 (0.029)
Colorecta
l 503
SAE 256 256 SVM 0.712 (0.052) 0.672 (0.037)
DAE 256-128 128 SVM 0.728 (0.056) 0.648 (0.046)
VAE 512-8 8 SVM 0.739 (0.070) 0.632 (0.037)
CAE 8-4 2,116 RF 0.809 (0.046) 0.704 (0.020) #Dim: Dimension; SAE: Sallow Autoencoder; DAE: Deep Autoencoder; VAE: Variational autoencoder; CAE: Convolutional autoencoder; SVM:
Support Vector Machine; RF: Random Forest; MLP: Multi-layer Perceptron *The number of units for SAE, DAE, and VAE; The number of filters for CAE; Layers are separated by a delimiter “-” **Note that as the models are optimized for AUC performance, not accuracy, it is required to re-train our models by optimizing accuracy if you
need to directly compare the accuracy performance with your models.
57
Table A2. Hyper-parameters used in grid search
Purpose Method
Hyper-parameter tuned with
grid search Used values
Learning
Representation
SAE Size of latent layer 32, 64, 128, 256, 512
DAE
Size of latent layer 32, 64, 128, 256, 512
# of hidden layers in both
encoder and decoder 1, 2
VAE
Size of latent layer 4, 8, 16
# of hidden units in the
hidden layers 32, 64, 128, 256, 512
CAE
# of convolutional layers 2, 3
# of filters in the first conv
layer 4, 8, 16, 32, 64
Learning
Classifier
SVM Penalty parameter C 2-5, 2-3, 2-1, 2, 23, 25
RBF kernel coefficient 2-15, 2-13, 2-11, 2-9, 2-7, 2-5, 2-3, 2-1, 2, 23
RF
# of trees (estimators) 100, 300, 500, 700, 900
The minimum number of
samples in a leaf node 1, 2, 3, 4, 5
Split criteria Gini impurity, information gain
MLP
# of hidden layers 1, 2, 3
# of hidden units in the first
layer 10, 30, 50, 100
Dropout rate 0.1, 0.3
# of epochs 30, 50, 100, 200, 300 # SAE: Sallow Autoencoder; DAE: Deep Autoencoder; VAE: Variational autoencoder; CAE: Convolutional autoencoder; SVM: Support Vector
Machine; RF: Random Forest; MLP: Multi-layer Perceptron
58
Table A3. Performance evaluation with area under precision-recall curve for IBD dataset
Microbiome
profile type Methods
Representation
Learning Classifier
AUC*
(Standard Error)
AUPRC**
(Standard Error)
Strain-level
marker profile
DeepMicro SAE SVM 0.9553 (0.013) 0.8653 (0.035)
MetAML . RF 0.8918 (0.033) 0.6770 (0.102)
PCA-based PCA MLP 0.9223 (0.024) 0.7965 (0.059)
RP-based RP RF 0.7882 (0.044) 0.5461 (0.079)
Species-level
abundance
profile
DeepMicro CAE RF 0.8659 (0.033) 0.7020 (0.064)
MetAML . RF 0.9153 (0.037) 0.7915 (0.076)
PCA-based PCA RF 0.8247 (0.034) 0.6220 (0.021)
RP-based RP RF 0.7365 (0.052) 0.4980 (0.075) # SAE: Sallow Autoencoder; CAE: Convolutional Autoencoder; PCA: Principal Component analysis; RP: Random Projection; SVM: Support
Vector Machine; RF: Random Forest; MLP: Multi-layer Perceptron
*AUC: Area Under the receiver operating characteristic (ROC) Curve
**AUPRC: Area Under the Precision-Recall Curve
59
Appendix B
Contents
• Figure B1. Visualization of tumor gene expression profiles of melanoma patients.
• Figure B2. Visualization of microbiome marker profiles of diabetic patients and healthy controls.
• Figure B3. Prediction performance on unseen data (AUPRC).
• Figure B4. Prediction performance on unseen data with Random forest (RF) classifier.
• Figure B5. Anti-PD1 therapy response prediction performance (AUROC) on unseen data by
varying the number of visual clusters and that of GAN models.
• Figure B6. Anti-PD1 therapy response prediction performance (AUPRC) on unseen data by
varying the number of visual clusters and that of GAN models.
• Figure B7. Type 2 diabetes prediction performance (AUROC) on unseen data by varying the
number of visual clusters and that of GAN models.
• Figure B8. Type 2 diabetes prediction performance (AUPRC) on unseen data by varying the
number of visual clusters and that of GAN models.
• Figure B9. t-SNE visualization of augmented tumor expression profiles derived from Random
augmentation along with the source and test (unseen) data of melanoma patients treated with anti-
PD1 therapy.
• Figure B10. t-SNE visualization of augmented tumor expression profiles derived from GMM
along with the source and test (unseen) data of melanoma patients treated with anti-PD1 therapy.
• Figure B11. t-SNE visualization of augmented tumor expression profiles derived from SMOTE
along with the source and test (unseen) data of melanoma patients treated with anti-PD1 therapy.
• Figure B12. t-SNE visualization of augmented microbiome profiles derived from DeepBioGen
along with the source and test (unseen) data of diabetic patients and healthy controls.
• Figure B13. t-SNE visualization of augmented microbiome profiles derived from Random
augmentation along with the source and test (unseen) data of diabetic patients and healthy
controls.
• Figure B14. t-SNE visualization of augmented microbiome profiles derived from GMM along
with the source and test (unseen) data of diabetic patients and healthy controls.
• Figure B15. t-SNE visualization of augmented microbiome profiles derived from SMOTE along
with the source and test (unseen) data of diabetic patients and healthy controls.
• Figure B16. Feature-wise WSS by the number of clusters.
• Figure B17. Conditional Wasserstein GAN architecture in DeepBioGen.
• Figure B18. Sample-wise WSS by the number of clusters.
• Table B1. Summary of sequencing data sets.
• Table B2. Hyper-parameter grid for optimizing classifiers.
• Table B3. Modified inception scores of generated sequencing profiles varying the number of
conditional Wasserstein GANs.
60
61
Figure B1. Visualization of tumor gene expression profiles of melanoma patients. a-b, The columns are
unordered genes before pre-processing, and each row indicates the profile of responder (a) or non-
responder (b). c-d, The columns are re-ordered genes with 4 clusters derived from feature-wise clustering.
e-f, The augmented profiles generated by DeepBioGen.
62
63
Figure B2. Visualization of microbiome marker profiles of diabetic patients and healthy controls. a-b,
The columns are unordered genes before pre-processing, and each row indicates the profile of healthy
controls (a) or type 2 diabetes (b). c-d, The columns are re-ordered genes with 4 clusters derived from
feature-wise clustering. e-f, The augmented profiles generated by DeepBioGen.
64
Figure B3. Prediction performance on unseen data (AUPRC). a-b, Results of anti-PD1 therapy response
prediction on unseen data by the state-or-the-art and baseline classifiers (gray) and by classifiers
generalized with DeepBioGen (red), SMOTE (green), GMM (yellow), and Random augmentation (blue);
Classification algorithms: Support Vector Machine (SVM) and Neural network (NN) which is a multi-
layer perceptron; Evaluation metric: Area under the precision-recall curve (AUPRC). c-d, Results of type
2 diabetes prediction on unseen data.
65
Figure B4. Prediction performance on unseen data with Random forest (RF) classifier. a-b, Results of
anti-PD1 therapy response prediction on unseen data by the state-or-the-art and baseline classifiers (gray)
and by the classifier generalized with DeepBioGen (red), SMOTE (green), GMM (yellow), and Random
augmentation (blue); Evaluation metrics: Area under the receiver operating characteristics (AUROC) and
Area under the precision-recall curve (AUPRC). c-d, Results of type 2 diabetes prediction on unseen
data; Evaluation metric: AUROC and AUPRC.
66
Figure B5. Anti-PD1 therapy response prediction performance (AUROC) on unseen data by varying the
number of visual clusters and that of GAN models. a-c, Varying the number of visual clusters while
fixing the number of GANs as five; Yellow star denotes the estimated number of visual clusters. d-f,
Varying the number of GANs while fixing the number of visual clusters as four; Yellow start denotes the
estimated number of GANs.
67
Figure B6. Anti-PD1 therapy response prediction performance (AUPRC) on unseen data by varying the
number of visual clusters and that of GAN models. a-c, Varying the number of visual clusters while
fixing the number of GANs as five; Yellow star denotes the estimated number of visual clusters. d-f,
Varying the number of GANs while fixing the number of visual clusters as four; Yellow start denotes the
estimated number of GANs.
68
Figure B7. Type 2 diabetes prediction performance (AUROC) on unseen data by varying the number of
visual clusters and that of GAN models. a-c, Varying the number of visual clusters while fixing the
number of GANs as eight; Yellow star denotes the estimated number of visual clusters. d-f, Varying the
number of GANs while fixing the number of visual clusters as six; Yellow start denotes the estimated
number of GANs.
69
Figure B8. Type 2 diabetes prediction performance (AUPRC) on unseen data by varying the number of
visual clusters and that of GAN models. a-c, Varying the number of visual clusters while fixing the
number of GANs as eight; Yellow star denotes the estimated number of visual clusters. d-f, Varying the
number of GANs while fixing the number of visual clusters as six; Yellow start denotes the estimated
number of GANs.
70
Figure B9. t-SNE visualization of augmented tumor expression profiles derived from Random
augmentation along with the source and test (unseen) data of melanoma patients treated with anti-PD1
therapy. a, The source (gray) and test data (red). b, The source, test, and augmented data (green). c,
Responders of the source and test data; An empirical boundary of responders of source data (red dotted
line). d, Responders of the source, test, and augmented data. e, Non-responders of the source and test
71
data; An empirical boundary of non-responders of source data (red dotted line). f, Non-responders of the
source, test, and augmented data.
Figure B10. t-SNE visualization of augmented tumor expression profiles derived from GMM along with
the source and test (unseen) data of melanoma patients treated with anti-PD1 therapy. a, The source
(gray) and test data (red). b, The source, test, and augmented data (green). c, Responders of the source
and test data; An empirical boundary of responders of source data (red dotted line). d, Responders of the
72
source, test, and augmented data. e, Non-responders of the source and test data; An empirical boundary of
non-responders of source data (red dotted line). f, Non-responders of the source, test, and augmented data.
Figure B11. t-SNE visualization of augmented tumor expression profiles derived from SMOTE along
with the source and test (unseen) data of melanoma patients treated with anti-PD1 therapy. a, The source
(gray) and test data (red). b, The source, test, and augmented data (green). c, Responders of the source
and test data; An empirical boundary of responders of source data (red dotted line). d, Responders of the
73
source, test, and augmented data. e, Non-responders of the source and test data; An empirical boundary of
non-responders of source data (red dotted line). f, Non-responders of the source, test, and augmented data.
74
Figure B12. t-SNE visualization of augmented microbiome profiles derived from DeepBioGen along
with the source and test (unseen) data of diabetic patients and healthy controls. a, The source (gray) and
test data (red). b, The source, test, and augmented data (green). c, Healthy controls of the source and test
data; An empirical boundary of healthy controls of source data (red dotted line). d, Healthy controls of the
source, test, and augmented data. e, Type 2 diabetes patients of the source and test data; An empirical
75
boundary of type 2 diabetes patients of source data (red dotted line). f, Type 2 diabetes patients of the
source, test, and augmented data.
Figure B13. t-SNE visualization of augmented microbiome profiles derived from Random augmentation
along with the source and test (unseen) data of diabetic patients and healthy controls. a, The source (gray)
and test data (red). b, The source, test, and augmented data (green). c, Healthy controls of the source and
test data; An empirical boundary of healthy controls of source data (red dotted line). d, Healthy controls
76
of the source, test, and augmented data. e, Type 2 diabetes patients of the source and test data; An
empirical boundary of type 2 diabetes patients of source data (red dotted line). f, Type 2 diabetes patients
of the source, test, and augmented data.
77
Figure B14. t-SNE visualization of augmented microbiome profiles derived from GMM along with the
source and test (unseen) data of diabetic patients and healthy controls. a, The source (gray) and test data
(red). b, The source, test, and augmented data (green). c, Healthy controls of the source and test data; An
empirical boundary of healthy controls of source data (red dotted line). d, Healthy controls of the source,
test, and augmented data. e, Type 2 diabetes patients of the source and test data; An empirical boundary
78
of type 2 diabetes patients of source data (red dotted line). f, Type 2 diabetes patients of the source, test,
and augmented data.
Figure B15. t-SNE visualization of augmented microbiome profiles derived from SMOTE along with the
source and test (unseen) data of diabetic patients and healthy controls. a, The source (gray) and test data
(red). b, The source, test, and augmented data (green). c, Healthy controls of the source and test data; An
empirical boundary of healthy controls of source data (red dotted line). d, Healthy controls of the source,
79
test, and augmented data. e, Type 2 diabetes patients of the source and test data; An empirical boundary
of type 2 diabetes patients of source data (red dotted line). f, Type 2 diabetes patients of the source, test,
and augmented data.
80
Figure B16. Feature-wise WSS by the number of clusters. a, RNA-seq tumor expression profiles
(optimum: 4). b, WGS human gut microbiome marker profile (optimum: 6).
81
Figure B17. Conditional Wasserstein GAN architecture in DeepBioGen. a, A generator network
generates realistic profiles from random noise. b, A critic network distinguishes realistic profiles from the
real.
82
Figure B18. Sample-wise WSS by the number of GANs. a, RNA-seq tumor expression profiles (optimal:
5). b, WGS human gut microbiome marker profile (optimal: 8).
83
Table B1. Summary of sequencing data sets.
Data type Year # of
Samples
# of
class 0*
# of
class
1**
Sequencing
platform Reference
RNA-seq
tumor
expression
profile
Source
2016 28 15 13 Illumina Hiseq
2000 Hugo et al.[81]
2017 98 54 44 Illumina Hiseq
2000/2500 Riaz et al.[82]
Test 2019 50 30 20 Illumina Hiseq
2500 Gide et al.[78]
WGS
human gut
microbiome
profile
Source 2012 344 174 170 Illumina Genome
Analyzer II Qin et al.[32]
Test 2014 96 43 53 Illumina HiSeq
2000
Karlsson et
al.[31] *Responders of anti-PD1 therapy for tumor expression profile or healthy controls for microbiome profile
**Non-responders of anti-PD1 therapy or type 2 diabetes for microbiome profile
84
Table B2. Hyper-parameter grid for optimizing classifiers.
Classification
algorithm Hyper-parameter Parameter grid
SVM
Kernel Linear and radial basis function (RBF)
Regularization penalty
C 2-4, 2-3, 2-2, 2-1, 20, 21, 22, and 24
Gamma ‘Scale’ (= 1/(n_features*X.var()) and
‘Auto’ (=1/n_features)
RF
# of estimators 27, 28, 29, and 210
Maximum # of features
for the best split Square root and log2 of n_features
Split criterion Gini impurity and information gain
NN
Hidden layers
(hidden units)
3 layers (128, 64, 32),
4 layers (128, 64, 32, 16), and
5 layers (128, 64, 32, 16, 8)
Learning rate
Constant (0.001),
invscaling (0.001/ pow(t, power_t) where t is time step),
and
adaptive (keep learning rate as long as training loss is
decreasing, otherwise divide the current learning rate by 5)
Alpha (L2 penalty) 0.0001, 0.001, 0.01, and 0.1
85
Table B3. Modified inception scores of generated sequencing profiles varying the number of conditional
Wasserstein GANs.
# of GANs RNA-seq tumor
expression profile
WGS human gut
microbiome profile
1 1.0764 1.1923
2 1.0745 1.2046
3 1.0745 1.2079
4 1.0746 1.2048
5 1.0779 1.2060
6 1.0780 1.2061
7 1.0771 1.2071
8 1.0779 1.2042
9 1.0778 1.2031
10 1.0761 1.2035
86
Appendix C
Contents
• Table C1. Hyper-parameter grid for optimizing classifiers
• Table C2. AUC of the classifiers trained with different approaches
87
Table C1. Hyper-parameter grid for optimizing classifiers
Classification
algorithm Hyper-parameter Parameter grid
SVM
Kernel Linear and radial basis function (RBF)
Regularization penalty
C 2-4, 2-3, 2-2, 2-1, 20, 21, 22, and 24
Gamma ‘Scale’ (= 1/(n_features*X.var()) and
‘Auto’ (=1/n_features)
RF
# of estimators 27, 28, 29, and 210
Maximum # of features
for the best split Square root and log2 of n_features
Split criterion Gini impurity and information gain
NN
Hidden layers
(hidden units)
3 layers (128, 64, 32),
4 layers (128, 64, 32, 16), and
5 layers (128, 64, 32, 16, 8)
Learning rate
Constant (0.001),
invscaling (0.001/ pow(t, power_t) where t is time step),
and
adaptive (keep learning rate as long as training loss is
decreasing, otherwise divide the current learning rate by 5)
Alpha (L2 penalty) 0.0001, 0.001, 0.01, and 0.1 - SVM: support vector machine; RF: random forest; NN: feedforward neural network
88
Table C2. AUC of the classifiers trained with different approaches
Approach Limeta
et al. No FS FS only FS + AE
DeepGeni
(FS + DBG + AE)
Algorithm RF SVM RF NN SVM RF NN SVM RF NN SVM RF NN
AUC 0.624 0.667 0.543 0.531 0.673 0.574 0.679 0.698 0.673 0.605 0.744 0.673 0.772
- FS: feature selection; AE: autoencoder; DBG: DeepBioGen
89
References
1. Schork, N.J., Personalized medicine: time for one-person trials. Nature, 2015. 520(7549): p. 609-
611.
2. Collins, F.S. and H. Varmus, A new initiative on precision medicine. New England journal of
medicine, 2015. 372(9): p. 793-795.
3. Council, N.R., Toward precision medicine: building a knowledge network for biomedical research
and a new taxonomy of disease. 2011: National Academies Press.
4. Ashley, E.A., et al., Clinical assessment incorporating a personal genome. The Lancet, 2010.
375(9725): p. 1525-1535.
5. Worthey, E.A., et al., Making a definitive diagnosis: successful clinical application of whole exome
sequencing in a child with intractable inflammatory bowel disease. Genetics in Medicine, 2011.
13(3): p. 255-262.
6. Ashley, E.A., Towards precision medicine. Nature Reviews Genetics, 2016. 17(9): p. 507.
7. Lin, E. and H.-Y. Lane, Machine learning and systems genomics approaches for multi-omics data.
Biomarker research, 2017. 5(1): p. 2.
8. Xie, B., et al., MOBCdb: a comprehensive database integrating multi-omics data on breast cancer
for precision medicine. Breast cancer research and treatment, 2018. 169(3): p. 625-632.
9. LeCun, Y., Y. Bengio, and G. Hinton, Deep learning. nature, 2015. 521(7553): p. 436-444.
10. LeCun, Y. and M. Ranzato. Deep learning tutorial. in Tutorials in International Conference on
Machine Learning (ICML’13). 2013. Citeseer.
11. Min, S., B. Lee, and S. Yoon, Deep learning in bioinformatics. Briefings in bioinformatics, 2017.
18(5): p. 851-869.
12. Miotto, R., et al., Deep learning for healthcare: review, opportunities and challenges. Briefings in
bioinformatics, 2018. 19(6): p. 1236-1246.
13. Ravì, D., et al., Deep learning for health informatics. IEEE journal of biomedical and health
informatics, 2016. 21(1): p. 4-21.
14. Russakovsky, O., et al., Imagenet large scale visual recognition challenge. International journal of
computer vision, 2015. 115(3): p. 211-252.
15. Hannouf, M., et al., Cost-effectiveness of using a gene expression profiling test to aid in identifying
the primary tumour in patients with cancer of unknown primary. The pharmacogenomics journal,
2017. 17(3): p. 286-300.
16. Street, W., Cancer Facts & Figures 2019. Am. Cancer Soc, 2018. 76.
90
17. Cho, I. and M.J. Blaser, The human microbiome: at the interface of health and disease. Nature
Reviews Genetics, 2012. 13(4): p. 260.
18. Huttenhower, C., et al., Structure, function and diversity of the healthy human microbiome. nature,
2012. 486(7402): p. 207.
19. McQuade, J.L., et al., Modulating the microbiome to improve therapeutic response in cancer. The
Lancet Oncology, 2019. 20(2): p. e77-e91.
20. Eloe-Fadrosh, E.A. and D.A. Rasko, The human microbiome: from symbiosis to pathogenesis.
Annual review of medicine, 2013. 64: p. 145-163.
21. Hamady, M. and R. Knight, Microbial community profiling for human microbiome projects: tools,
techniques, and challenges. Genome research, 2009. 19(7): p. 1141-1152.
22. Scholz, M., et al., Strain-level microbial epidemiology and population genomics from shotgun
metagenomics. Nature methods, 2016. 13(5): p. 435.
23. Truong, D.T., et al., MetaPhlAn2 for enhanced metagenomic taxonomic profiling. Nature methods,
2015. 12(10): p. 902-903.
24. Kramer, M.A., Nonlinear principal component analysis using autoassociative neural networks.
AIChE journal, 1991. 37(2): p. 233-243.
25. Nguyen, T.H., et al., Deep learning for metagenomic data: using 2d embeddings and convolutional
neural networks. arXiv preprint arXiv:1712.00244, 2017.
26. Nguyen, T.H., et al., Disease classification in metagenomics with 2d embeddings and deep learning.
arXiv preprint arXiv:1806.09046, 2018.
27. Pasolli, E., et al., Machine learning meta-analysis of large metagenomic datasets: tools and
biological insights. PLoS computational biology, 2016. 12(7): p. e1004977.
28. Cawley, G.C. and N.L. Talbot, On over-fitting in model selection and subsequent selection bias in
performance evaluation. Journal of Machine Learning Research, 2010. 11(Jul): p. 2079-2107.
29. Varma, S. and R. Simon, Bias in error estimation when using cross-validation for model selection.
BMC bioinformatics, 2006. 7(1): p. 91.
30. Qin, J., et al., A human gut microbial gene catalogue established by metagenomic sequencing.
nature, 2010. 464(7285): p. 59.
31. Karlsson, F.H., et al., Gut metagenome in European women with normal, impaired and diabetic
glucose control. Nature, 2013. 498(7452): p. 99-103.
32. Qin, J., et al., A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature,
2012. 490(7418): p. 55-60.
91
33. Le Chatelier, E., et al., Richness of human gut microbiome correlates with metabolic markers.
Nature, 2013. 500(7464): p. 541.
34. Qin, N., et al., Alterations of the human gut microbiome in liver cirrhosis. Nature, 2014. 513(7516):
p. 59.
35. Zeller, G., et al., Potential of fecal microbiota for early‐stage detection of colorectal cancer.
Molecular systems biology, 2014. 10(11): p. 766.
36. Glorot, X. and Y. Bengio. Understanding the difficulty of training deep feedforward neural
networks. in Proceedings of the thirteenth international conference on artificial intelligence and
statistics. 2010.
37. Kingma, D.P. and M. Welling, Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114,
2013.
38. Li, F., H. Qiao, and B. Zhang, Discriminatively boosted image clustering with fully convolutional
auto-encoders. Pattern Recognition, 2018. 83: p. 161-173.
39. Krizhevsky, A., I. Sutskever, and G.E. Hinton. Imagenet classification with deep convolutional
neural networks. in Advances in neural information processing systems. 2012.
40. Kingma, D.P. and J. Ba, Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980, 2014.
41. Cortes, C. and V. Vapnik, Support-vector networks. Machine learning, 1995. 20(3): p. 273-297.
42. Pearson, K., LIII. On lines and planes of closest fit to systems of points in space. The London,
Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 1901. 2(11): p. 559-572.
43. Bingham, E. and H. Mannila. Random projection in dimensionality reduction: applications to
image and text data. in Proceedings of the seventh ACM SIGKDD international conference on
Knowledge discovery and data mining. 2001. ACM.
44. Dasgupta, S., Experiments with random projection. arXiv preprint arXiv:1301.3849, 2013.
45. Dasgupta, S. and A. Gupta, An elementary proof of the Johnson-Lindenstrauss lemma. International
Computer Science Institute, Technical Report, 1999. 22(1): p. 1-5.
46. Saito, T. and M. Rehmsmeier, The precision-recall plot is more informative than the ROC plot
when evaluating binary classifiers on imbalanced datasets. PloS one, 2015. 10(3): p. e0118432.
47. Mazurowski, M.A., et al., Training neural network classifiers for medical decision making: The
effects of imbalanced datasets on classification performance. Neural networks, 2008. 21(2-3): p.
427-436.
48. Baker, M., 1,500 scientists lift the lid on reproducibility. Nature, 2016. 533(7604).
92
49. Bernau, C., et al., Cross-study validation for the assessment of prediction algorithms.
Bioinformatics, 2014. 30(12): p. i105-i112.
50. Castaldi, P.J., I.J. Dahabreh, and J.P. Ioannidis, An empirical assessment of validation practices for
molecular classifiers. Briefings in bioinformatics, 2011. 12(3): p. 189-202.
51. Collins, F.S. and L.A. Tabak, Policy: NIH plans to enhance reproducibility. Nature, 2014.
505(7485): p. 612-613.
52. Mattsson-Carlgren, N., et al., Increasing the reproducibility of fluid biomarker studies in
neurodegenerative studies. Nature communications, 2020. 11(1): p. 1-11.
53. Leek, J.T., et al., Tackling the widespread and critical impact of batch effects in high-throughput
data. Nature Reviews Genetics, 2010. 11(10): p. 733-739.
54. Ganin, Y., et al., Domain-adversarial training of neural networks. The Journal of Machine
Learning Research, 2016. 17(1): p. 2096-2030.
55. Hoffman, J., et al. Cycada: Cycle-consistent adversarial domain adaptation. in Proceedings of the
International Conference on Machine Learning 2018. ICML.
56. Saenko, K., et al. Adapting visual category models to new domains. in Proceedings of the European
Conference on Computer Vision. 2010. ECCV.
57. Li, H., et al. Domain generalization with adversarial feature learning. in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition. 2018. CVPR.
58. Li, Y., et al. Deep domain generalization via conditional invariant adversarial networks. in
Proceedings of the European Conference on Computer Vision. 2018. ECCV.
59. Matsuura, T. and T. Harada. Domain Generalization Using a Mixture of Multiple Latent Domains.
in Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence. 2020. AAAI.
60. Carlucci, F.M., et al. Domain generalization by solving jigsaw puzzles. in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition. 2019. CVPR.
61. Li, D., et al. Learning to generalize: Meta-learning for domain generalization. in Proceedings of
the Thirty-Second AAAI Conference on Artificial Intelligence. 2018. AAAI.
62. Shankar, S., et al. Generalizing Across Domains via Cross-Gradient Training. in Proceedings of
the International Conference on Learning Representations. 2018. ICLR.
63. Volpi, R., et al. Generalizing to unseen domains via adversarial data augmentation. in Proceedings
of the 32nd International Conference on Neural Information Processing Systems. 2018.
64. Antoniou, A., A. Storkey, and H. Edwards, Data augmentation generative adversarial networks.
arXiv preprint arXiv:1711.04340, 2017.
93
65. Wong, S.C., et al. Understanding data augmentation for classification: when to warp? in
Proceedings of the International Conference on Digital Image Computing: techniques and
applications. 2016. IEEE DICTA.
66. Zhang, X., et al. Dada: Deep adversarial data augmentation for extremely low data regime
classification. in Proccedings of the International Conference on Acoustics, Speech and Signal
Processing. 2019. IEEE ICASSP.
67. Goodfellow, I., et al., Generative adversarial nets. Advances in neural information processing
systems, 2014. 27: p. 2672-2680.
68. Calimeri, F., et al. Biomedical data augmentation using generative adversarial neural networks. in
International conference on artificial neural networks. 2017. Springer.
69. Sandfort, V., et al., Data augmentation using generative adversarial networks (CycleGAN) to
improve generalizability in CT segmentation tasks. Scientific reports, 2019. 9(1): p. 1-9.
70. Madani, A., et al. Chest x-ray generation and data augmentation for cardiovascular abnormality
classification. in Proccedings of the International Society for Optics and Photonics. 2018.
71. Marouf, M., et al., Realistic in silico generation and augmentation of single-cell RNA-seq data
using generative adversarial networks. Nature communications, 2020. 11(1): p. 1-12.
72. Emilsson, V., et al., Genetics of gene expression and its effect on disease. Nature, 2008. 452(7186):
p. 423-428.
73. Jiang, P., et al., Signatures of T cell dysfunction and exclusion predict cancer immunotherapy
response. Nature medicine, 2018. 24(10): p. 1550-1558.
74. Auslander, N., et al., Robust prediction of response to immune checkpoint blockade therapy in
metastatic melanoma. Nature medicine, 2018. 24(10): p. 1545-1549.
75. Oh, M. and L. Zhang, DeepMicro: deep representation learning for disease prediction based on
microbiome data. Scientific reports, 2020. 10(1): p. 1-9.
76. Reynolds, D.A., T.F. Quatieri, and R.B. Dunn, Speaker verification using adapted Gaussian
mixture models. Digital signal processing, 2000. 10(1-3): p. 19-41.
77. Chawla, N.V., et al., SMOTE: synthetic minority over-sampling technique. Journal of artificial
intelligence research, 2002. 16: p. 321-357.
78. Gide, T.N., et al., Distinct immune cell populations define response to anti-PD-1 monotherapy and
anti-PD-1/anti-CTLA-4 combined therapy. Cancer cell, 2019. 35(2): p. 238-255. e6.
79. Thorndike, R.L., Who belongs in the family? Psychometrika, 1953. 18(4): p. 267-276.
80. Maaten, L.v.d. and G. Hinton, Visualizing data using t-SNE. Journal of machine learning research,
2008. 9(Nov): p. 2579-2605.
94
81. Hugo, W., et al., Genomic and transcriptomic features of response to anti-PD-1 therapy in
metastatic melanoma. Cell, 2016. 165(1): p. 35-44.
82. Riaz, N., et al., Tumor and microenvironment evolution during immunotherapy with nivolumab.
Cell, 2017. 171(4): p. 934-949. e16.
83. Geurts, P., D. Ernst, and L. Wehenkel, Extremely randomized trees. Machine learning, 2006. 63(1):
p. 3-42.
84. Arjovsky, M., S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. in
International conference on machine learning. 2017. PMLR.
85. Gulrajani, I., et al., Improved training of wasserstein gans. arXiv preprint arXiv:1704.00028, 2017.
86. Ben-David, S., et al., Analysis of representations for domain adaptation. Advances in neural
information processing systems, 2007. 19: p. 137.
87. Kifer, D., S. Ben-David, and J. Gehrke. Detecting change in data streams. in VLDB. 2004. Toronto,
Canada.
88. Salimans, T., et al. Improved techniques for training GANs. in Proceedings of the 30th
International Conference on Neural Information Processing Systems. 2016.
89. Gurumurthy, S., R. Kiran Sarvadevabhatla, and R. Venkatesh Babu. Deligan: Generative
adversarial networks for diverse and limited data. in Proceedings of the IEEE conference on
computer vision and pattern recognition. 2017.
90. Barratt, S. and R. Sharma, A note on the inception score. arXiv preprint arXiv:1801.01973, 2018.
91. Gopalakrishnan, V., et al., Gut microbiome modulates response to anti–PD-1 immunotherapy in
melanoma patients. Science, 2018. 359(6371): p. 97-103.
92. Matson, V., et al., The commensal microbiome is associated with anti–PD-1 efficacy in metastatic
melanoma patients. Science, 2018. 359(6371): p. 104-108.
93. Routy, B., et al., Gut microbiome influences efficacy of PD-1–based immunotherapy against
epithelial tumors. Science, 2018. 359(6371): p. 91-97.
94. Marcus, L., et al., FDA approval summary: pembrolizumab for the treatment of microsatellite
instability-high solid tumors. Clinical Cancer Research, 2019. 25(13): p. 3753-3758.
95. Baruch, E.N., et al., Fecal microbiota transplant promotes response in immunotherapy-refractory
melanoma patients. Science, 2021. 371(6529): p. 602-609.
96. Davar, D., et al., Fecal microbiota transplant overcomes resistance to anti–PD-1 therapy in
melanoma patients. Science, 2021. 371(6529): p. 595-602.
97. Shaikh, F.Y., J.J. Gills, and C.L. Sears, Impact of the microbiome on checkpoint inhibitor treatment
in patients with non-small cell lung cancer and melanoma. EBioMedicine, 2019. 48: p. 642-647.
95
98. Chaput, N., et al., Baseline gut microbiota predicts clinical response and colitis in metastatic
melanoma patients treated with ipilimumab. Annals of Oncology, 2017. 28(6): p. 1368-1379.
99. Frankel, A.E., et al., Metagenomic shotgun sequencing and unbiased metabolomic profiling
identify specific human gut microbiota and metabolites associated with immune checkpoint therapy
efficacy in melanoma patients. Neoplasia, 2017. 19(10): p. 848-855.
100. Vétizou, M., et al., Anticancer immunotherapy by CTLA-4 blockade relies on the gut microbiota.
Science, 2015. 350(6264): p. 1079-1084.
101. Limeta, A., et al., Meta-analysis of the gut microbiota in predicting response to cancer
immunotherapy in metastatic melanoma. JCI insight, 2020. 5(23).
102. Wang, J., et al., Generalizing to Unseen Domains: A Survey on Domain Generalization. arXiv
preprint arXiv:2103.03097, 2021.
103. Cammarota, G., et al., Gut microbiome, big data and machine learning to promote precision
medicine for cancer. Nature Reviews Gastroenterology & Hepatology, 2020. 17(10): p. 635-648.
104. Wilkinson, J., et al., Time to reality check the promises of machine learning-powered precision
medicine. The Lancet Digital Health, 2020.
105. Wang, F., R. Kaushal, and D. Khullar, Should health care demand interpretable artificial
intelligence or accept “black box” medicine? 2020, American College of Physicians.
106. Svensson, V., et al., Interpretable factor models of single-cell RNA-seq via variational
autoencoders. Bioinformatics, 2020. 36(11): p. 3418-3421.
107. Peters, B.A., et al., Relating the gut metagenome and metatranscriptome to immunotherapy
responses in melanoma patients. Genome medicine, 2019. 11(1): p. 1-14.
108. Eisenhauer, E.A., et al., New response evaluation criteria in solid tumours: revised RECIST
guideline (version 1.1). European journal of cancer, 2009. 45(2): p. 228-247.
109. Chen, S., et al., fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics, 2018. 34(17):
p. i884-i890.
110. Milanese, A., et al., Microbial abundance, activity and population genomic profiling with mOTUs2.
Nature communications, 2019. 10(1): p. 1-11.