deep learning for enhancing precision medicine min oh

Deep Learning for Enhancing Precision Medicine

Min Oh

Dissertation submitted to the faculty of the Virginia Polytechnic Institute and State University in

partial fulfillment of the requirements for the degree of

Doctor of Philosophy

In

Computer Science and Applications

Liqing Zhang (Chair)

Bert Huang

B. Aditya Prakash

Zhi Sheng

Youngmi Yoon

May 10th, 2021

Blacksburg, Virginia

Keywords: Deep Learning, Precision Medicine, Omics data

© 2021, Min Oh CC BY


Min Oh

ABSTRACT

Most medical treatments have been developed aiming at the best-on-average efficacy for large

populations, resulting in treatments successful for some patients but not for others. It necessitates

the need for precision medicine that tailors medical treatment to individual patients. Omics data

holds comprehensive genetic information on individual variability at the molecular level and hence

the potential to be translated into personalized therapy. However, the attempts to transform omics

data-driven insights into clinically actionable models for individual patients have been limited.

Meanwhile, advances in deep learning, one of the most promising branches of artificial intelligence,

have produced unprecedented performance in various fields. Although several deep learning-based

methods have been proposed to predict individual phenotypes, they have not established the state

of the practice, due to instability of selected or learned features derived from extremely high

dimensional data with low sample sizes, which often results in overfitted models with high

variance. To overcome the limitation of omics data, recent advances in deep learning models,

including representation learning models, generative models, and interpretable models, can be

considered. The goal of the proposed work is to develop deep learning models that can overcome

the limitation of omics data to enhance the prediction of personalized medical decisions. To

achieve this, three key challenges should be addressed: 1) effectively reducing dimensions of

omics data, 2) systematically augmenting omics data, and 3) improving the interpretability of

omics data.


Min Oh

GENERAL AUDIENCE ABSTRACT

Most medical treatments have been developed aiming at the best-on-average efficacy for large

populations, resulting in treatments successful for some patients but not for others. It necessitates

the need for precision medicine that tailors medical treatment to individual patients. Biological

data such as DNA sequences and snapshots of genetic activities hold comprehensive information

on individual variability and hence the potential to accelerate personalized therapy. However, the

attempts to transform data-driven insights into clinical models for individual patients have been

limited. Meanwhile, advances in deep learning, one of the most promising branches of artificial

intelligence, have produced unprecedented performance in various fields. Although several deep

learning-based methods have been proposed to predict individual treatment or outcome, they have

not established the state of the practice, due to the complexity of biological data and limited

availability, which often result in overfitted models that may work on training data but not on test

data or unseen data. To overcome the limitation of biological data, recent advances in deep learning

models, including representation learning models, generative models, and interpretable models,

can be considered. The goal of the proposed work is to develop deep learning models that can

overcome the limitation of omics data to enhance the prediction of personalized medical decisions.

To achieve this, three key challenges should be addressed: 1) effectively reducing the complexity

of biological data, 2) generating realistic biological data, and 3) improving the interpretability of

biological data.

iv

Acknowledgments

I found happiness and pleasure from accomplishments during my Ph.D. program but from time to

time, it was like walking down an endless tunnel and chaining of many frustrations come from

uncertainties. I was able to get it done as so many people support and help me. Above all, I would

like to express my sincere appreciation to my wife, Boram Choi, for her devoted love and support.

Without her support, it was impossible to finish my degree. Also, I express my sincere gratitude

to my mother, Jin-Hyang Kang, for letting me dream big and praying for blessings. It was my great

privilege and pleasure to have Professor Liqing Zhang as my advisor. I appreciate her support,

advising, and especially, being on my side in many uncertain situations. I would like to thank my

committee members, Dr. Youngmi Yoon, Dr. Bert Huang, Dr. B. Aditya Prakash, and Dr. Zhi

Sheng. Especially, Dr. Yoon encouraged me to dream about being Ph.D. overseas and

continuously dedicated to inspiring me to succeed. I was lucky enough to have four internships at

Microsoft and met smart and warm mentors and managers who help me grow up. Specifically, I

appreciate Dr. Erdal Coşgun, Dr. Alexandra Savelieva, Santhanagopalan Raghavan, Manuel

Schröder, and Rouslan Beletski.

v

Table of contents

Introduction ......................................................................................................................... 1

Objectives ........................................................................................................................... 4

Chapter 1: Deep representation learning for disease prediction based on microbiome data

......................................................................................................................................... 6

1.1 Introduction ............................................................................................................... 6

1.2 Methods ..................................................................................................................... 7

1.3 Results ..................................................................................................................... 14

1.4 Discussion ................................................................................................................ 17

Chapter 2: Generalizing predictions to unseen sequencing profiles via visual data

augmentation .................................................................................................................... 19

2.1 Introduction ............................................................................................................. 19

2.2 Results ..................................................................................................................... 20

2.3 Discussion ................................................................................................................ 28

2.4 Methods ................................................................................................................... 28

Chapter 3: Deep generalized interpretable autoencoder elucidates gut microbiota for better

cancer immunotherapy ...................................................................................................... 34

3.1 Introduction ............................................................................................................. 34

3.2 Methods ................................................................................................................... 36

3.3 Results ..................................................................................................................... 40

3.4 Discussion ................................................................................................................ 43

Conclusion ........................................................................................................................ 45

Appendix A ....................................................................................................................... 47

Appendix B ....................................................................................................................... 59

Appendix C ....................................................................................................................... 86

References ......................................................................................................................... 89

1

Introduction

Most medical treatments have been developed aiming at the best-on-average efficacy for large populations,

resulting in treatments successful for some patients but not for others [1]. It necessitates the need for more

precise medical treatment and prevention strategies with consideration of individual variability [2].

Precision medicine refers to the customization of medical treatment tailored to the individual characteristics

of each patient [3]. With a better understanding of the patient’s genetic information, medical decisions can

be personalized and more effective [4, 5]. As a basis enabling precision medicine, omics data, including

genomic, transcriptomic, and metagenomic data, is increasingly being studied [6-8]. Omics data holds

comprehensive genetic information presenting individual variability at the molecular level and hence the

potential to be translated into personalized therapy. However, the attempts to transform omics data-driven

insights into clinically actionable models for an individual patient have been limited.

Meanwhile, advances in artificial intelligence have enabled the smarter data-driven approach for biomedical

research. Especially, deep learning, one of the most promising branches of artificial intelligence has

produced unprecedented performance in various fields [9]. The major advancements have been in image

and speech recognition as well as natural language processing and language translation. The successes of

deep learning originate from how it learns hierarchical representations of data by increasing the level of

abstraction [10]. Several deep learning-based solutions have been proposed to predict individual phenotypes,

by engineering omics data features using state-of-the-art deep architectures [11-13]. These deep models

outperform traditional machine learning models in terms of accuracy, however, they have not established

the state of the practice, due to instability of selected or learned features derived from extremely high

dimensional data with low sample sizes. Generally, omics data is high-dimensional compared to the number

of samples in most studies. For example, a typical gene expression study measures the activity of tens of

thousands of genes per person, while only a few hundred patients and healthy controls are examined. The

sample-dimension ratio could be even much lower in metagenomic data, which contains hundreds of

thousands of strain-level gene markers but for only some hundred or fewer samples. In contrast to typical

deep models trained on a massive amount of data in relatively low-dimensional space in fields such as

image recognition [14], models trained on the limited omics data in high-dimensional space are often

overfitted with high variance. The high-dimensional omics data with a low sample size entails the sparsity

of the data in feature space, limiting generalization of the learned model. As a result, most deep models

based on omics data come with little or no guarantees for reliable decisions on unseen data.

One naïve solution to address the problem of high dimensional data with low sample sizes might be to

collect samples much greater than the size of the dimensions. However, in general practice, it is nearly

2

impossible to secure that many samples regardless of the types of omics data. For example, the cost of

collecting a cost-effective gene expression profile (FDA-approved 2000-GEP test) to diagnose a tumor site

is about $3,300 per person in US dollars [15]. This test measures expression levels of over two thousand

genes for each person and approximately 7 million dollars is needed to acquire 2000 samples. Ideally, at

least 178,000 samples might be required to achieve the same sample-dimension ratio as MNIST handwritten

digit data set that has led to successful deep learning models, and it costs over 587 million US dollars. Even

if the cost issue is resolved, it is highly unrealistic to recruit patients under the same medical condition and

consent to the use of their samples. For instance, in 2019, about 30,000 cases in the US were diagnosed

with small cell lung cancer that shows the least 5-year survival rate (6%) among the types of lung cancer

which is the leading cause of cancer death [16]. Even though it is assumed that collecting the data from all

the survivors is possible, the number of available cases is far less than that required.

As a feasible solution, traditionally, dimensionality reduction algorithms have been utilized to overcome

the high dimensionality. Dimensionality reduction algorithms aim at mapping original samples in high-

dimensional space into low-dimensional space. Generally, linear embedding techniques such as principle

component analysis were widely used, although usually there is significant information loss as only very

limited amounts of variance in the omics data were explained by mapped variables. Another potential

solution is generating synthetic omics data. Formerly, data augmentation techniques were used to amplify

training data and were able to regularize prediction models to some extent. However, the regularized models

tend to underperform in external validation and this implies current data augmentation techniques may not

generalize to an unseen domain whose data distribution is significantly shifted.

Hence, to overcome the limitation of omics data, a more reasonable way should be examined. Currently,

although recent advances in deep learning models have been made in particular fields such as computer

vision and natural language processing, only a little effort has been devoted to applying them to omics data.

Consequently, most prediction models are limited to perform well only in a carefully controlled omics data

set. Thus, the goal of the proposed work is to develop deep learning models that can overcome the limitation

of omics data to enhance the prediction of personalized medical decisions. To achieve this, three key

challenges should be addressed: 1) effectively reducing dimensions of omics data, 2) systematically

augmenting omics data, and 3) improving the interpretability of omics data. The key factor enabling deep

learning to address the challenges is the combination of lessons derived from deep representation learning,

deep generative models, and interpretable deep learning:

• Deep neural networks have been successfully utilized to extract better representations to be suitable

for secondary analysis in various domains, including image recognition and audio translation,

3

compared to traditional representation learning algorithms. This deep representation learning may

allow us to capture the underlying hidden structure of complex omics data in a low-dimensional

space. By improving the quality of embedding in biologically relevant latent space, it might be

possible to get a better interpretation of data in much lower-dimensional space.

• Deep generative models have been applied to augment image data and it has improved image

classification performance. The success in data augmentation with deep generative models could

be transferred to the omics data by learning the probability distribution of omics data and

amplifying training data with synthetic data.

• Deep learning models are usually black-box and their outcomes are difficult to interpret. The lack

of interpretability may prevent the prediction models from being adopted in clinical practice as

clinicians and decision-makers prioritize the explainability of the predictions. Interpretable deep

learning has the potential to provide useful insights for understanding prediction derived from

omics data.

4

Objectives

The goal of the proposed work is to develop deep learning models that can overcome the limitation of omics

data derived from its high dimensionality to enhance the prediction of personalized medical decisions.

Objective 1: Deep Representation Learning for Disease Prediction Based on Microbiome Data

Human microbiota plays a key role in human health and growing evidence supports the potential use of

microbiome as a predictor of various diseases. However, the high-dimensionality of microbiome data, often

in the order of hundreds of thousands, yet low sample sizes, poses a great challenge for machine learning-

based prediction algorithms. This imbalance induces the data to be highly sparse, preventing from learning

a better prediction model. Also, there has been little work on deep learning applications to microbiome data

with a rigorous evaluation scheme. To address these challenges, we propose DeepMicro, a deep

representation learning framework allowing for the effective representation of microbiome profiles.

DeepMicro transforms high-dimensional microbiome data into a robust low-dimensional representation

using various autoencoders and applies machine learning classification algorithms on the learned

representation.

Objective 2: Generalizing Predictions to Unseen Sequencing Profiles via Visual Data Augmentation

Predictive models trained on sequencing profiles often fail to achieve expected performance when

externally validated on unseen profiles. While many factors such as batch effects, small data sets, and

technical errors contribute to the gap between source and unseen data distributions, it is a challenging

problem to generalize the predictive models across studies without any prior knowledge of the unseen data

distribution. Here, this study proposes DeepBioGen, a sequencing profile augmentation procedure that

characterizes visual patterns of sequencing profiles, generates realistic profiles based on a deep generative

model capturing the patterns, and generalizes the subsequent classifiers.

Objective 3: Deep Generalized Interpretable Autoencoder Elucidating Gut Microbiota for Better

Cancer Immunotherapy

Recent studies revealed that gut microbiota modulates the response to cancer immunotherapy and fecal

microbiota transplantation has clinical benefit in melanoma patients during the treatment. Understanding

5

microbiota affecting individual response is crucial to advance precision oncology. However, it is

challenging to identify the key microbial taxa with limited data as statistical and machine learning models

often lose their generalizability. In this study, DeepGeni, a deep generalized interpretable autoencoder, is

proposed to improve the generalizability and interpretability of microbiome profiles by augmenting data

and by introducing interpretable links in the autoencoder.

6

Chapter 1: Deep Representation Learning for Disease Prediction Based on

Microbiome Data

1.1 Introduction

As our knowledge of microbiota grows, it becomes increasingly clear that the human microbiota plays a

key role in human health and diseases [17]. The microbial community, composed of trillions of microbes,

is a complex and diverse ecosystem living on and inside a human. These commensal microorganisms

benefit humans by allowing them to harvest inaccessible nutrients and maintain the integrity of mucosal

barriers and homeostasis. Especially, the human microbiota contributes to the host immune system

development, affecting multiple cellular processes such as metabolism and immune-related functions [17,

18]. They have been shown to be responsible for carcinogenesis of certain cancers and substantially affect

therapeutic response [19]. All these emerging evidences substantiate the potential use of microbiota as a

predictor for various diseases [20].

The development of high-throughput sequencing technologies has enabled researchers to capture a

comprehensive snapshot of the microbial community of interest. The most common components of the

human microbiome can be profiled with 16S rRNA gene sequencing technology in a cost-effective way

[21]. Comparatively, shotgun metagenomic sequencing technology can provide a deeper resolution profile

of the microbial community at the strain level [22, 23]. As the cost of shotgun metagenomic sequencing

keeps decreasing and the resolution increasing, it is likely that a growing role of the microbiome in human

health will be uncovered from the mounting metagenomic datasets.

Although novel technologies have dramatically increased our ability to characterize human microbiome

and there is evidence suggesting the potential use of the human microbiome for predicting disease state,

how to effectively utilize the human microbiome data faces several key challenges. Firstly, effective

dimensionality reduction that preserves the intrinsic structure of the microbiome data is required to handle

the high dimensional data with low sample sizes, especially the microbiome data with strain-level

information that often contain hundreds of thousands of gene markers but for only some hundred or fewer

samples. With a low number of samples, large number of features can cause the curse of dimensionality,

usually inducing sparsity of the data in the feature space. Along with traditional dimensionality reduction

algorithms, autoencoder that learns a low-dimensional representation by reconstructing the input [24] can

be applied to exploit microbiome data. Secondly, given the fast amounting metagenomic data, there is an

inadequate effort in adapting machine learning algorithms for predicting disease state based on microbiome

7

data. In particular, deep learning is a class of machine learning algorithms that builds on large multi-layer

neural networks, and that can potentially make effective use of metagenomic data. With the rapidly growing

attention from both academia and industry, deep learning has produced unprecedented performance in

various fields, including not only image and speech recognition, natural language processing, and language

translation but also biological and healthcare research [11]. A few studies have applied deep learning

approaches to abundance profiles of the human gut microbiome for disease prediction [25, 26]. However,

there has been no research utilizing strain-level profiles for the purpose. Comparatively, strain level profiles,

often containing hundreds of thousands of gene markers’ information, should be more informative for

accurately classifying the samples into patient and healthy control groups across different types of diseases

than abundance profiles that usually contain only a few hundred bacteria’s abundance information [27].

Lastly, to evaluate and compare the performance of machine learning models, it is necessary to introduce a

rigorous validation framework to estimate their performance over unseen data. Pasolli et al., a study that

built classification models based on microbiome data, utilized a 10-fold cross-validation scheme that tunes

the hyper-parameters on the test set without using a validation set [27]. This approach may overestimate

model performance as it exposes the test set to the model in the training procedure [28, 29].

To address these issues, we propose DeepMicro, a deep representation learning framework that deploys

various autoencoders to learn robust low-dimensional representations from high-dimensional microbiome

profiles and trains classification models based on the learned representation. We applied a thorough

validation scheme that excludes the test set from hyper-parameter optimization to ensure fairness of model

comparison. Our model surpasses the current best methods in terms of disease state prediction of

inflammatory bowel disease, type 2 diabetes in the Chinese cohort as well as European women cohort, liver

cirrhosis, and obesity. DeepMicro is open-sourced and publicly available software to benefit future research,

allowing researchers to obtain a robust low-dimensional representation of microbiome profiles with user-

defined deep architecture and hyper-parameters.

1.2 Methods

Dataset and Extracting Microbiome Profiles

We considered publicly available human gut metagenomic samples of six different disease cohorts:

inflammatory bowel disease (IBD), type 2 diabetes in European women (EW-T2D), type 2 diabetes in

Chinese (C-T2D) cohort, obesity (Obesity), liver cirrhosis (Cirrhosis), and colorectal cancer (Colorectal).

All these samples were derived from whole-genome shotgun metagenomic studies that used Illumina

8

paired-end sequencing technology. Each cohort consists of healthy control and patient samples as shown in

Table 1. IBD cohort has 25 individuals with inflammatory bowel disease and 85 healthy controls [30]. EW-

T2D cohort has 53 European women with type 2 diabetes and 43 healthy European women [31]. C-T2D

cohort has 170 Chinese individuals with type 2 diabetes and 174 healthy Chinese controls [32]. Obesity

cohort has 164 obese patients and 89 non-obese controls [33]. Cirrhosis cohort has 118 patients with liver

cirrhosis and 114 healthy controls [34]. Colorectal cohort has 48 colorectal cancer patients and 73 healthy

controls [35]. In total, 1,156 human gut metagenomic samples, obtained from MetAML repository [27],

were used in our experiments.

Table 1. Human gut microbiome datasets used for disease state prediction

Disease Dataset

name

# total

samples

# of healthy

controls

# of patient

samples

Data source

references

Inflammatory Bowel

Disease IBD 110 85 25 [30]

Type 2 Diabetes EW-T2D 96 43 53 [31]

C-T2D 344 174 170 [32]

Obesity Obesity 253 89 164 [33]

Liver Cirrhosis Cirrhosis 232 114 118 [34]

Colorectal Cancer Colorectal 121 73 48 [35]

Two types of microbiome profiles were extracted from the metagenomic samples: 1) strain-level marker

profile and 2) species-level relative abundance profile. MetaPhlAn2 was utilized to extract these profiles

with default parameters [23]. We utilized MetAML to preprocess the abundance profile by selecting

species-level features and excluding sub-species-level features [27]. The strain-level marker profile consists

of binary values indicating the presence (1) or absence (0) of a certain strain. The species-level relative

abundance profile consists of real values in [0,1] indicating the percentages of the species in the total

observed species. The abundance profile has a few hundred dimensions, whereas the marker profile has a

much larger number of dimensions, up to over a hundred thousand in the current data (Table 2).

Table 2. The number of dimensions of the preprocessed microbiome profiles

Profile type IBD EW-T2D C-T2D Obesity Cirrhosis Colorectal

marker profile 91,756 83,456 119,792 99,568 120,553 108,034

abundance

profile 443 381 572 465 542 503

9

Deep Representation Learning

An autoencoder is a neural network reconstructing its input 𝑥. Internally, its general form consists of an

encoder function 𝑓𝜙(∙) and a decoder function 𝑓′𝜃(∙) where 𝜙 and 𝜃 are parameters of encoder and decoder

functions, respectively. An autoencoder is trained to minimize the difference between an input 𝑥 and a

reconstructed input 𝑥′, the reconstruction loss (e.g., squared error) that can be written as follows:

𝐿(𝑥, 𝑥′) = ‖𝑥 − 𝑥′‖2 = ‖𝑥 − 𝑓′𝜃 (𝑓𝜙(𝑥))‖2.

After training an autoencoder, we are interested in obtaining a latent representation 𝑧 = 𝑓𝜙(𝑥) of the input

using the trained encoder. The latent representation, usually in a much lower-dimensional space than the

original input, contains sufficient information for reconstructing the original input as close as possible. We

utilized this representation to train classifiers for disease prediction.

For the DeepMicro framework, we incorporated various deep representation learning techniques, including

shallow autoencoder (SAE), deep autoencoder (DAE), variational autoencoder (VAE), and convolutional

autoencoder (CAE), to learn a low-dimensional embedding for microbiome profiles. Note that the diverse

combinations of hyper-parameters defining the structure of autoencoders (e.g., the number of units and

layers) have been explored in a grid fashion as described below, however, users are not limited to the tested

hyper-parameters and can use their own hyper-parameter grid fitted to their data.

Firstly, we utilized SAE, the simplest autoencoder structure composed of the encoder part where the input

layer is fully connected with the latent layer, and the decoder part where the output layer produces

reconstructed input 𝑥′ by taking weighted sums of outputs of the latent layer. We introduced a linear

activation function for the latent and output layer. Other options for the loss and activation functions are

available for users (such as binary cross-entropy and sigmoid function). Initial values of the weights and

bias were initialized with Glorot uniform initializer [36]. We examined five different sizes of dimensions

for the latent representation (32, 64, 128, 256, and 512).

In addition to the SAE model, we implemented the DAE model by introducing hidden layers between the

input and latent layers as well as between the latent and output layers. All of the additional hidden layers

were equipped with Rectified Linear Unit (ReLu) activation function and Glorot uniform initializer. The

same number of hidden layers (one layer or two layers) were inserted into both encoder and decoder parts.

Also, we gradually increased the number of hidden units. The number of hidden units in the added layers

was set to the double of the successive layer in the encoder part and to the double of the preceding layer in

the decoder part. With this setting, model complexity is controlled by both the number of hidden units and

10

the number of hidden layers, maintaining structural symmetry of the model. For example, if the latent layer

has 512 hidden units and if two layers are inserted to the encoder and decoder parts, then the resulting

autoencoder has 5 hidden layers with 2048, 1024, 512, 1024, and 2048 hidden units, respectively. Similar

to SAE, we varied the number of hidden units in the latent layer as follows: 32, 64, 128, 256, 512, thus, in

total, we tested 10 different DAE architectures (Appendix A Table A2).

A variational autoencoder (VAE) learns probabilistic representations 𝑧 given input 𝑥 and then use these

representations to reconstruct input 𝑥′ [37]. Using variational inference, the true posterior distribution of

latent embeddings (i.e., 𝑝(𝑧|𝑥)) can be approximated by the introduced posterior 𝑞𝜙(𝑧|𝑥) where 𝜙 are

parameters of an encoder network. Unlike the previous autoencoders learning an unconstrained

representation, VAE learns a generalized latent representation under the assumption that the posterior

approximation follows Gaussian distribution. The encoder network encodes the means and variances of the

multivariate Gaussian distribution. The latent representation 𝑧 can be sampled from the learned posterior

distribution 𝑞𝜙(𝑧|𝑥) ~ Ν(𝜇, Σ). Then the sampled latent representation is passed into the decoder network

to generate the reconstructed input 𝑥′ ~ 𝑔𝜃(𝑥|𝑧) where 𝜃 are the parameters of the decoder.

To approximate the true posterior, we need to minimize the Kullback-Leibler (KL) divergence between the

introduced posterior and the true posterior,

𝐾𝐿 (𝑞𝜙(𝑧|𝑥)||𝑝(𝑧|𝑥)) = −𝐸𝐿𝐵𝑂(𝜙, 𝜃; 𝑥) + log(𝑝(𝑥)),

rewritten as

log(𝑝(𝑥)) = 𝐸𝐿𝐵𝑂(𝜙, 𝜃; 𝑥) + 𝐾𝐿 (𝑞𝜙(𝑧|𝑥)||𝑝(𝑧|𝑥)),

where 𝐸𝐿𝐵𝑂(𝜙, 𝜃; 𝑥) is an evidence lower bound on the log probability of the data because the KL term

must be greater than or equal to zero. It is intractable to compute the KL term directly but minimizing the

KL divergence is equivalent to maximizing the lower bound, decomposed as follows:

𝐸𝐿𝐵𝑂(𝜙, 𝜃; 𝑥) = 𝔼𝑞𝜙(𝑧|𝑥)[log(𝑔𝜃(𝑥|𝑧))] − 𝐾𝐿 (𝑞𝜙(𝑧|𝑥)||𝑝(𝑧)).

The final objective function can be induced by converting the maximization problem to the minimization

problem.

𝐿(𝜙, 𝜃; 𝑥) = −𝔼𝑞𝜙(𝑧|𝑥)[log(𝑔𝜃(𝑥|𝑧))] + 𝐾𝐿 (𝑞𝜙(𝑧|𝑥)||𝑝(𝑧))

11

The first term can be viewed as a reconstruction term as it forces the inferred latent representation to recover

its corresponding input and the second KL term can be considered as a regularization term to modulate the

posterior of the learned representation to be Gaussian distribution. We used ReLu activation and Glorot

uniform initializer for intermediate hidden layers in encoder and decoder. One intermediate hidden layer

was used and the number of hidden units in it varied from 32, 64, 128, 256, to 512. The latent layer was set

to 4, 8, or 16 units. Thus, altogether we tested 15 different model structures.

Instead of fully connected layers, a convolutional autoencoder (CAE) is equipped with convolutional layers

in which each unit is connected to only local regions of the previous layer [38]. A convolutional layer

consists of multiple filters (kernels) and each filter has a set of weights used to perform convolution

operation that computes dot products between a filter and a local region [39]. We used ReLu activation and

Glorot uniform initializer for convolutional layers. We did not use any pooling layer as it may generalize

too much to reconstruct an input. The 𝑛-dimensional input vector was reshaped like a squared image with

a size of 𝑑 × 𝑑 × 1 where 𝑑 = ⌊√𝑛⌋ + 1. As 𝑑2 ≥ 𝑛, we padded the rest part of the reshaped input with

zeros. To be flexible to an input size, the filter size of the first convolutional layer was set to 10% of the

input width and height, respectively (i.e. ⌊0.1𝑑⌋ × ⌊0.1𝑑⌋). For the first convolutional layer, we used 25%

of the filter size as the size of stride which configures how much we slide the filter. For the following

convolutional layers in the encoder part, we used 10% of the output size of the preceding layer as the filter

size and 50% of this filter size as the stride size. All units in the last convolutional layer of the encoder part

have been flattened in the following flatten layer which is designated as a latent layer. We utilized

convolutional transpose layers (deconvolutional layers) to make the decoder symmetry to the encoder. In

our experiment, the number of filters in a convolutional layer was set to half of that of the preceding layer

for the encoder part. For example, if the first convolutional layer has 64 filters and there are three

convolutional layers in the encoder, then the following two convolutional layers have 32 and 16 filters,

respectively. We varied the number of convolutional layers from 2 to 3 and tried five different numbers of

filters in the first convolutional layer (4, 8, 16, 32, and 64). In total, we tested 10 different CAE model

structures.

To train deep representation models, we split each dataset into a training set, a validation set, and a test set

(64% training set, 16% validation set, and 20% test set; Appendix A Figure A1). Note that the test set was

withheld from training the model. We used the early-stopping strategy, that is, trained the models on the

training set, computed the reconstruction loss for the validation set after each epoch, stopped the training if

there was no improvement in validation loss during 20 epochs, and then selected the model with the least

validation loss as the best model. We used mean squared error for reconstruction loss and applied adaptive

12

moment estimation (Adam) optimizer for gradient descent with default parameters (learning rate: 0.001,

epsilon: 1e-07) as provided in the original paper [40]. We utilized the encoder part of the best model to

produce a low-dimensional representation of the microbiome data for downstream disease prediction.

Prediction of disease states based on the learned representation

We built classification models based on the encoded low-dimensional representations of microbiome

profiles (Figure 1). Three machine learning algorithms, support vector machine (SVM), random forest (RF),

and Multi-Layer Perceptron (MLP), were used. We explored hyper-parameter space with grid search. SVM

maximizes the margin between the supporting hyperplanes to optimize a decision boundary separating data

points of different classes [41]. In this study, we utilized both radial basis function (RBF) kernel and a linear

kernel function to compute decision margins in the transformed space to which the original data was

mapped. We varied penalty parameter C (2-5, 2-3, …, 25) for both kernels as well as kernel coefficient gamma

(2-15, 2-13, …, 23) for RBF kernel. In total, 60 different combinations of hyper-parameters were examined to

optimize SVM (Appendix A Table A2).

13

Figure 1. DeepMicro framework. An autoencoder is trained to map the input X to the low-dimensional

latent space with the encoder and to reconstruct X with the decoder. The encoder part is reused to produce

a latent representation of any new input X that is in turn fed into a classification algorithm to determine

whether the input is the positive or negative class.

RF builds multiple decision trees based on various sub-samples of the training data and merges them to

improve the prediction accuracy. The size of sub-samples is the same as that of training data but the samples

are drawn randomly with replacement from the training data. For the hyper-parameter grid of RF classifier,

the number of trees (estimators) was set to 100, 300, 500, 700, and 900, and the minimum number of

samples in a leaf node was altered from 1 to 5. Also, we tested two criteria, Gini impurity and information

gain, for selecting features to split a node in a decision tree. For the maximum number of features considered

to find the best split at each split, we used a square root of 𝑛 and a logarithm to base 2 of 𝑛 (𝑛 is the sample

size). In total, we tested 100 combinations of hyper-parameters of RF.

MLP is an artificial neural network classifier that consists of an input layer, hidden layers, and an output

layer. All of the layers are fully connected to their successive layer. We used ReLu activations for all hidden

layers and sigmoid activation for the output layer that has a single unit. The number of units in the hidden

layers was set to half of that of the preceding layer except the first hidden layer. We varied the number of

hidden layers (1, 2, and 3), the number of epochs (30, 50, 100, 200, and 300), the number of units in the

first hidden layer (10, 30, 50, 100), and dropout rate (0.1 and 0.3). In total, 120 hyper-parameter

combinations were tested in our experiment.

We implemented DeepMicro in Python 3.5.2 using machine learning and data analytics libraries, including

Numpy 1.16.2, Pandas 0.24.2, Scipy 1.2.1, Scikt-learn 0.20.3, Keras 2.2.4, and Tensorflow 1.13.1. Source

code is publicly available at the git repository (https://github.com/minoh0201/DeepMicro).

Performance Evaluation

To avoid an overestimation of prediction performance, we designed a thorough performance evaluation

scheme (Appendix A Figure A1). For a given dataset (e.g. Cirrhosis), we split it into training and test set in

the ratio of 8:2 with a given random partition seed, keeping a ratio between classes in both training and test

set to be the same as that of the given dataset. Using only the training set, a representation learning model

was trained. Then, the learned representation model was applied to the training set and test set to obtain

https://github.com/minoh0201/DeepMicro

14

dimensionality-reduced training and test set. After the dimensionality has been reduced, we conducted 5-

fold cross-validation on the training set by varying hyper-parameters of classifiers. The best hyper-

parameter combination for each classifier was selected by averaging an accuracy metric of the five different

results. The area under the receiver operating characteristics curve (AUC) was used for performance

evaluation. We trained a final classification model using the whole training set with the best combination

of hyper-parameters and tested it on the test set. This procedure was repeated five times by changing the

random partition seed at the beginning of the procedure. The resulting AUC scores were averaged and the

average was used to compare model performance.

1.3 Results

We developed DeepMicro, a deep representation learning framework for predicting individual phenotype

based on microbiome profiles. Various autoencoders (SAE, DAE, VAE, and CAE) have been utilized to

learn a low-dimensional representation of the microbiome profiles. Then three classification models

including SVM, RF, and MLP were trained on the learned representation to discriminate between disease

and control sample groups. We tested our framework on six disease datasets (Table 1), including

inflammatory bowel disease (IBD), type 2 diabetes in European women (EW-T2D), type 2 diabetes in

Chinese (C-T2D), obesity (Obesity), liver cirrhosis (Cirrhosis), and colorectal cancer (Colorectal). For all

the datasets, two types of microbiome profiles, strain-level marker profile and species-level relative

abundance profile, have been extracted and tested (Table 2). Also, we devised a thorough performance

evaluation scheme that isolates the test set from the training and validation sets in the hyper-parameter

optimization phase to compare various models (See Methods and Appendix A Figure A1).

We compared our method to the current best approach (MetAML) that directly trained classifiers, such as

SVM and RF, on the original microbiome profile [27]. We utilized the same hyper-parameters grid used in

MetAML for each classification algorithm. In addition, we tested Principal Component Analysis (PCA)

and Gaussian Random Projection (RP), using them as the replacement of the representation learning to

observe how traditional dimensionality reduction algorithms behave. For PCA, we selected the principal

components explaining 99% of the variance in the data [42]. For RP, we set the number of components to

be automatically adjusted according to Johnson-Lindenstrauss lemma (eps parameter was set to 0.5) [43-

45].

We picked the best model for each approach in terms of prediction performance and compared the

approaches across the datasets. Figure 2 shows the results of DeepMicro and the other approaches for the

15

strain-level marker profile. DeepMicro outperforms the other approaches for five datasets, including IBD

(AUC = 0.955), EW-T2D (AUC = 0.899), C-T2D (AUC = 0.763), Obesity (AUC = 0.659), and Cirrhosis

(AUC = 0.940). For Colorectal dataset, DeepMicro has slightly lower performance than the best approach

(DeepMicro’s AUC = 0.803 vs. MetAML’s AUC = 0.811). The marker profile-based models generally

perform better than the abundance profile-based models (Appendix A Figure A8 and A2). The only

exception is Obesity dataset for which the abundance-based DeepMicro model shows better performance

(AUC = 0.674). Note that as AUC could be misleading in an imbalanced classification scenario [46], we

also evaluated the area under the precision-recall curve (AUPRC) for the imbalanced data set IBD and

observed the same trend between AUC and AUPRC (Appendix A Table A3).

Figure 2. Disease prediction performance for marker profile-based models. Prediction performance of

various methods built on marker profile has been assessed with AUC. MetAML utilizes support vector

machine (SVM) and random forest (RF), and the superior model is presented (green). Principal component

analysis (PCA; blue) and gaussian random projection (RP; yellow) have been applied to reduce dimensions

of datasets before classification. DeepMicro (red) applies shallow autoencoder (SAE), deep autoencoder

16

(DAE), variational autoencoder (VAE), and convolutional autoencoder (CAE) for dimensionality reduction.

Then SVM, RF, and multi-layer perceptron (MLP) classification algorithms have been used.

For marker profile, none of the autoencoders dominate across the datasets in terms of getting the best

representation for classification. Also, the best classification algorithm varied according to the learned

representation and to the dataset (Figure 3). For abundance profile, CAE dominates over the other

autoencoders with RF classifier across all the datasets (Appendix A Figure A3).

Figure 3. Disease prediction performance for different autoencoders based on marker profile (assessed with

AUC). Classifiers used: support vector machine (SVM), random forest (RF), and multi-layer perceptron

(MLP); Autoencoders used: shallow autoencoder (SAE), deep autoencoder (DAE), variational autoencoder

(VAE), and convolutional autoencoder (CAE)

We also directly trained MLP on the dataset without representation learning and compared the prediction

performance with that of the traditional approach (the best between SVM and RF). It is shown that MLP

performs better than MetAML in three datasets, EW-T2D, C-T2D, and Obesity, when marker profile is

used (Appendix A Figure A4). However, when abundance profile is used, the performance of MLP was

worse than that of the traditional approach across all the datasets (Appendix A Figure A5).

Furthermore, we compared running time of DeepMicro on marker profiles with a basic approach not using

representation learning. For comparison, we tracked both training time and representation learning time.

17

For each dataset, we tested the best performing representation learning model producing the highest AUC

score (i.e. SAE for IBD and EW-T2D, DAE for Obesity and Colorectal, and CAE for C-T2D and Cirrhosis;

Appendix A Table A1). We fixed the seed for random partitioning of the data, and applied the formerly

used performance evaluation procedure where 5-fold cross-validation is conducted on the training set to

obtain the best hyper-parameter with which the best model is trained on the whole training set and is

evaluated on the test set (See Methods). The computing machine we used for timestamping is running on

Ubuntu 18.04 and equipped with an Intel Core i9-9820X CPU (10 cores), 64 GB Memory, and a GPU of

NVIDIA GTX 1080 Ti. We note that our implementation utilizes GPU when it learns representations and

switches to CPU mode to exhaustively use multiple cores in a parallel way to find best hyper-parameters

of the classifiers. Table 3 shows the benchmarking result on marker profile. It is worth noting that

DeepMicro is 8X to 30X times faster than the basic approach (17X times faster on average). Even if MLP

is excluded from the benchmarking because it requires heavy computation, DeepMicro is up to 5X times

faster than the basic (2X times faster on average).

Table 3. Time benchmark for DeepMicro and basic approaches without representation learning (in sec)

Method IBD EW-T2D C-T2D Obesity Cirrhosis Colorectal

Basic

approach

SVM* 126 85 1705 711 777 187

RF 42 41 99 79 72 50

MLP 3,776 2,449 12,057 8,186 8,593 4,508

Total elapsed 3,943 2,575 13,861 8,976 9,442 4,745

DeepMicro

RL 74 194 554 113 521 215

SVM 2 2 8 8 17 2

RF 28 28 47 33 40 30

MLP 103 93 188 137 287 105

Total elapsed 207 317 798 291 864 352 *RL: Representation Learning; SVM: Support Vector Machine; RF: Random Forest; MLP: Multi-layer Perceptron

1.4 Discussion

We developed a deep learning framework transforming a high-dimensional microbiome profile into a low-

dimensional representation and building classification models based on the learned representation. At the

beginning of this study, the main goal was to reduce dimensions as strain-level marker profile has too many

dimensions to handle, expecting that noisy and unnecessary information fades out and the refined

representation becomes tractable for downstream prediction. Firstly, we tested PCA on marker profile and

it showed a slight improvement in prediction performance for C-T2D and Obesity but not for the others.

The preliminary result indicates that either some of the meaningful information was dropped or noisy

18

information still remains. To learn meaningful feature representations, we trained various autoencoders on

microbiome profiles. Our intuition behind the autoencoders was that the learned representation should keep

essential information in a condensed way because autoencoders are forced to prioritize which properties of

the input should be encoded during the learning process. We found that although the most appropriate

autoencoder usually allows for better representation that in turn results in better prediction performance,

what kind of autoencoder is appropriate highly depends on problem complexity and intrinsic properties of

the data.

In the previous study, it has been shown that adding healthy controls of the other datasets could improve

prediction performance assessed by AUC [27]. To check if this finding can be reproduced, for each dataset,

we added control samples of the other datasets only into the training set and kept the test set the same as

before. Appendix A Figure A6 shows the difference between the best performing models built with and

without additional controls. In general, prediction performance dropped (on average by 0.037) once

negative (control) samples are introduced to the training set across the datasets in almost all approaches

except only a few cases (Appendix A Figure A6). In contrast to the previous study, the result indicates that

the insertion of only negative samples into the training set may not help to improve the classification models,

and a possible explanation might be that changes in the models rarely contribute to improving the

classification of positive samples [47]. Interestingly, if we added negative samples into the whole data set

before split it into training and test set, we usually observed improvements in prediction performance.

However, we found that these improvements are trivial because introducing negative samples into the test

set easily reduces false positive rate (as the denominator of false positive rate formula is increased),

resulting in higher AUC scores.

Even though adding negative samples might not be helpful for a better model, it does not mean that

additional samples are meaningless. We argue that more samples can improve prediction performance,

especially when a well-balanced set of samples is augmented. To test this argument, we gradually increased

the proportion of the training set and observed how prediction performance changed over the training sets

of different sizes. Generally, improved prediction performance has been observed as more data of both

positive and negative samples are included (Appendix A Figure A7). With the continued availability of

large samples of microbiome data, the deep representation learning framework is expected to become

increasingly effective for both condensed representation of the original data and also downstream prediction

based on the deep representation.

19

Chapter 2: Generalizing predictions to unseen sequencing profiles via visual

data augmentation

2.1 Introduction

Predictive models relying on genomic signatures and biomarkers often suffer significantly inferior

performance in the independent validation on external data sets in biomedical research such as disease

diagnostics, prognostics, drug discovery, and precision medicine, resulting in a contribution to

reproducibility crisis [48-51]. Irreproducible models can lead to not only invalid conclusions misleading

subsequent studies but also a substantial waste of time and effort for researchers trying to commercialize

the models to benefit patients [52]. A major factor behind these failures is the lack of generalizability across

studies, in each of which the number of the heterogeneous data points is insufficient to obtain statistical

power to overcome the generalization barrier. In addition to the sample size, usually, there is a significant

gap between source data that are used to train classifiers and target data that are used to evaluate the

classifiers. One possible cause of the gap is the batch effect such as different sample cohorts, different lab

environments, and differences in experimental protocols across studies [51, 53], which violates the

assumption that source and target data are drawn from the same distribution.

In many real-world applications, trained systems fail to produce accurate predictions for unseen data with

the shifted distribution. For example, illumination or viewpoint changes in data acquisition for an object

detection system and noisier environments for a speech-to-text translation system could easily disrupt the

desired outcome. To address this issue, domain adaptation algorithms have been proposed to better align

source and target data in a domain-invariant feature space when knowledge of target domains is available

during the training phase [54-56]. However, in practice, it is common that no clue on the target domain is

provided. As a more ambitious goal, domain generalization studies focus on training a model generalizing

to the unseen domain without any foreknowledge of the unseen domain. Recent studies proposed different

ways of domain generalization such as extracting domain-invariant features [57-59], leveraging self-

supervised tasks to guide and learn robust representation [60], simulating domain shift in meta-learning

[61], and adding perturbed samples [62, 63]. Although these methods achieved promising performance on

benchmark data sets, their requirements, such as having datasets from multiple source domains or sufficient

enough for splitting and simulating domain shift, are often not satisfied in biomedical research where only

a limited number of heterogeneous data points in a single source domain is available.

20

Data augmentation techniques in the computer vision field show promising potential in improving

classifiers by reducing overfitting to source data [64-66]. Especially, recent advances in deep generative

models such as generative adversarial networks (GAN) [67] allow generating visual contents that are

indistinguishable from real ones and also augmenting image data to guide in finding better decision

boundaries [64, 66]. More recently, generative models have been utilized to augment medical images,

including Magnetic Resonance Images (MRI) [68], computed tomography (CT) [69], and X-ray images

[70]. However, there has been little effort in transferring the success in computer vision to biomedical

sequencing data [71]. Furthermore, it is unclear whether augmentation of sequencing data could overcome

the generalization barrier across different studies.

In this study, DeepBioGen, a data augmentation procedure that establishes visual patterns from sequencing

profiles and generates new sequencing profiles capturing the visual patterns based on conditional

Wasserstein GAN, is proposed to enhance the generalizability of the prediction models to unseen data.

DeepBioGen outperforms other augmentation methods in generalizing classifiers to unseen data. Also, the

classifiers generalized by DeepBioGen surpass state-of-the-art classifiers that are designed to work on

unseen profiles when tested on two scenarios: devising a prediction model for immune checkpoint blockade

(anti-PD1) responsiveness in melanoma patients based on RNA sequencing (RNA-seq) data and building

a diagnostic model for type 2 diabetes based on whole-genome metagenomic sequencing data. DeepBioGen

source code is free and available at https://anonymous.4open.science/r/dda7fadf-514e-41b9-a578-

9de25edb4a70/.

2.2 Results

Formation and augmentation of visual patterns of sequencing profiles

Sequencing profiles, such as RNA-seq measurements of gene expression levels, consist of numerical values

that indicate the activity of thousands of genes in different samples or patients. While many statistical

methods such as multivariate linear regression assume that variables are independent of one another, in

reality, genes’ activities are highly correlated [72]. In DeepBioGen, to take into account and visually

formalize the interactivity of related genes, similar features in the profiles were clustered together,

presenting visible patterns after converting numerical values to colors (Figure 4a). Subsequently, a

conditional Wasserstein GAN equipped with convolutional layers to capture the local visual patterns was

implemented to augment the sequencing profiles conditioned on class labels. During the augmentation

https://anonymous.4open.science/r/dda7fadf-514e-41b9-a578-9de25edb4a70/

https://anonymous.4open.science/r/dda7fadf-514e-41b9-a578-9de25edb4a70/

21

phase, multiple GANs were initialized and trained with different random seeds to promote diversity in the

augmented data points (Figure 4b).

To inspect the visual quality of augmented data, two different sequencing profiles were used to train the

generative models: one is RNA-seq expression profiles of melanoma patients, and the other is gut

microbiome profiles of type 2 diabetes patients. Visual assessment showed that the augmented profiles

preserved the boundaries of the clustered features and within-cluster color patterns in the same manner as

source data. It is also difficult to distinguish an augmented profile from source data without the original tag

(Appendix B Figure B1 and B2).

23

Figure 4. DeepBioGen, a sequencing profile augmentation procedure that generalizes classifiers to enhance

prediction performance on unseen data. a, Feature-wise clustering of sequencing profiles to form

perceptible visual patterns. b, Training multiple conditional Wasserstein GANs equipped with up-

convolutional and convolutional layers. c, Generating augmented data from the multiple generators of GAN

models and learning classifiers based on the augmented data along with source data to predict unseen data.

d-e, Results of anti-PD1 therapy response prediction on unseen data by the state-or-the-art and baseline

classifiers (gray) and by classifiers generalized with DeepBioGen (red), SMOTE (green), GMM (yellow),

and Random augmentation (blue); Classification algorithms: Support Vector Machine (SVM) and Neural

network (NN) which is a multi-layer perceptron; Evaluation metric: Area under the receiver operating

characteristics (AUROC). f-g, Results of type 2 diabetes prediction on unseen data.

Generalized classification on unseen sequencing profiles

The augmented data derived from the multiple generators of GANs were injected into training data along

with the source data. The training data was used to train three machine learning classifiers, support vector

machine (SVM), an artificial neural network (NN), and random forest (RF) (Figure 4c). The classifiers

were trained to predict non-responders of cancer immunotherapy (anti-PD1) based on RNA-seq gene

expression profiles or type 2 diabetes based on human gut microbiome profile.

To validate the generalizability of the classifiers, test (unseen) data were secured from studies that are

independent of the source studies. Classification performances on test data were evaluated using an area

under the receiver operating characteristics (AUROC) and an area under the precision-recall curve

(AUPRC). State-of-the-art predictors, TIDE [73] and IMPRES [74] for predicting patient response to anti-

PD1 therapy, and DeepMicro [75] for using deep representations of microbiome data to predict disease

states, were compared to DeepBioGen. Besides, widely-used data augmentation techniques, such as

Gaussian Mixture Model (GMM) [76] and Synthetic Minority Over-sampling Technique (SMOTE) [77],

were used to generate augmented data for comparison. The classifiers trained only on source data were used

as the baseline comparison.

Remarkably, DeepBioGen-based classifiers surpass not only state-of-the-art classifiers but also classifiers

that are trained on augmented data generated by different augmentation methods in both immunotherapy

response (Figure 4d-e and Appendix B Figure B3) and diabetes predictions (Figure 4f-g and Appendix B

Figure B3). Notably, even though DeepBioGen-based classifiers have no clue of test data, it outperforms

Gide et al.’s immune marker classifier (AUROC=0.77) that directly leverages the test data through

24

differential expression analysis [78]. Especially, DeepBioGen provides a stable performance boost to SVM

and NN classifiers for both problems as the augmentation rate increases. RF classifiers partially benefit

from DeepBioGen, showing generally worse performance than SVM and NN classifiers (Appendix B

Figure B4). Consistently, DeepBioGen reduces ℋ-divergence between the source data and the test data

more than other augmentation methods (Table 4).

Table 4. ℋ-divergence between source and test data

Data type DeepBioGen SMOTE GMM Random

RNA-seq tumor

expression profile 0.368 0.688 0.512 0.888

WGS human gut

microbiome profile 0.268 0.288 0.352 0.858

Impact of visual clusters and multiple generators

DeepBioGen uses the elbow method [79] to estimate the optimal number of visual clusters and GANs. To

assess the ability of the approach in inferring the ideal parameters based on source data only, DeepBioGen

models with a varying number of visual clusters or GANs were used to generate the augmented data for

training classifiers. The classification results of unseen data show that the elbow method elicits an optimal

or nearly optimal number of clusters and GANs in both immunotherapy response and diabetes prediction

problems (Appendix B Figure B5-B8).

Notably, the number of clusters has more impact on classification performance than the number of GANs,

suggesting that how sequencing data are clustered and thus presented visually plays a major role in

improving the generalizability of DeepBioGen (Appendix B Figure B5-B8). Results also show that diverse

generators of multiple Wasserstein GANs are more effective in diversifying the augmented sequencing data

than a single generator, thus leading to better generalizability (Appendix B Table B3).

Augmentations beyond the boundary of source data

To visualize how DeepBioGen augmented data to generalize classifiers, the source, augmented and test

data were embedded to 2-dimensional space with t-SNE algorithm [80]. In melanoma patient profiles, the

source and test data are placed distantly, while within-cluster data points with different anti-PD1 responses

are located closely in both data clusters (Figure 5a). The data embeddings were plotted separately for two

25

classes, and an empirical outer boundary of the source data based on the outermost data points heading

toward the test data was drawn with a red dotted line (Figure 5c and 5e). Interestingly, DeepBioGen

generated data points beyond the outer boundaries of the source data cluster (Figure 5d and 5f), whereas

other augmentation methods rarely produced data points that cross the boundaries (Appendix B Figure B7-

B9).

In microbiome profiles of healthy controls and diabetic patients, the test data cluster resides in the side

region of the source data cluster, thus depicting a moderately shifted distribution (Appendix B Figure B12).

DeepBioGen produced augmented microbiome profiles across boundaries of the source data cluster.

Particularly, the outermost augmented data points beyond the source boundaries are closely placed with

test data points that cross the border (Appendix B Figure B12), while other methods rarely generate data

points overpassing the boundaries (Appendix B Figure B13-B15).

26

Figure 5. t-SNE visualization of augmented tumor expression profiles derived from DeepBioGen along

with the source (grey), augmented (green), and test (unseen, red) data of melanoma patients treated with

anti-PD1 therapy. a, The source and test data. b, The source, test, and augmented data. c, Responders of the

source and test data; An empirical boundary of responders of source data (red dotted line). d, Responders

27

of the source, test, and augmented data. e, Non-responders of the source and test data; An empirical

boundary of non-responders of source data (red dotted line). f, Non-responders of the source, test, and

augmented data.

Progression-free survival analysis of predicted anti-PD1 treatment responders

For the predicted responder (PR) and non-responder (PNR) patients to anti-PD1 treatment determined by

DeepBioGen-supported SVM classifier, progression-free survival analysis was conducted to estimate the

clinical outcome. For comparison, state-of-the-art classifiers based on genomic signatures, IMPRES and

TIDE, were evaluated with the same analysis. With the DeepBioGen classifier or IMPRES, the PR group

has a significantly longer progression-free survival rate compared to the PNR group (Figure 6a and 6b),

whereas the two TIDE predicted groups do not show a significant difference.

Importantly, the median survival time of PRs classified by the DeepBioGen classifier was 755 days (95%

CI [335, N/A]), compared to 440 days (95% CI [125, N/A]) for the IMPRES classified PRs . Also, the

DeepBioGen classifier tends to be more sensitive in predicting responders than IMPRES, likely posing a

lower risk of unnecessary treatment suggestions often accompanied by unnecessary side effects (Figure 6

and Table 5).

Figure 6. Kaplan-Meier plots of progression-free survival for predicted responder (PR) and non-responder

(PNR) patients determined by three classifiers. a, generalized SVM classifier with DeepBioGen

augmentations. b, IMPRES. c, TIDE.

28

Table 5. Summary statistics for progression-free survival analysis

Classifier Prediction N

Median

survival time

(days)

95% CI MR* HR** 95% CI P-

value

DeepBioGen-

SVM

PR 27 755 [335, NA] 9.21 3.72 [1.88, 7.36] < 0.001

PNR 23 82 [76, 125]

IMPRES PR 40 440 [125, NA]

5.71 3.47 [1.66, 7.49] 0.002 PNR 10 77 [58, NA]

TIDE PR 40 231 [82, 870]

0.76 0.99 [0.45, 2.17] > 0.9 PNR 10 303 [96, NA]

*Median ratio; **Hazard ratio

2.3 Discussion

DeepBioGen is unique as it takes input sequencing profiles in machine-understandable visual form, while

visualization of sequencing data (e.g. heatmap of differentially expressed genes) has been typically used to

present findings in a human-understandable manner. One potential advantage of feeding DeepBioGen with

visually recognizable data is that visual patterns difficult to be identified with human eyes may be captured

and characterized in embedding space.

Even with a limited amount of source data, DeepBioGen can alleviate batch effects of independent studies

without details for batch correction such as sample cohorts, lab environments, and experimental protocol,

by reducing the gap between the source and unseen data. Also, DeepBioGen is highly extensible to other

biological data whose feature dependency is not negligible.

2.4 Methods

Sequencing profiles and pre-processing

Clinical genomic data containing RNA-seq tumor expression profiles of melanoma patients and their

responsiveness to anti-PD1 therapy were secured from three independent studies [78, 81, 82] (Appendix B

Table B1). Fifty samples in the most recent study31 were used as test data and the others were used as source

data. RNA-seq read counts were normalized to transcripts per million (TPM) and then log2-transformed.

To focus on genes related to primary mechanisms of tumor immune evasion, recently identified T cell

29

signature genes [73], such as regulators of T cell dysfunction and suppressors of T cell infiltration into the

tumor, were selected out of 18,570 common genes across the studies. In total, 702 genes were considered

as features of initial inputs.

Human gut metagenomic sequencing reads of type 2 diabetic patients and healthy controls were acquired

from two independent studies: one on the Chinese cohort [32] and the other on the European women cohort

[31] (Appendix B Table B1). Using MetaPhlAn2 [23], strain-level marker profiles were extracted from the

metagenomic samples. In total, the number of common strain-level markers that are considered as initial

features was 74,240. The European samples in the more recent study were used as test data and Chinese

samples as source data.

Formation of visual patterns from sequencing profiles

Each measurement in source data was standardized by subtracting the mean and dividing by the standard

deviation. The same standardization was applied to test data using the mean and standard deviation of

source data. To meet the dimensional requirement of the pre-defined input layer, the extremely randomized

trees [83] feature selection algorithm was applied to the source data to select 256 features. The k-means

clustering algorithm was used to cluster features. Based on the elbow point where the within-cluster sum

of squared errors (WSS) starts to decrease significantly, the optimal number of clusters was determined to

be 4 for RNA-seq tumor expression profiles and 6 for human gut microbiome profiles (Appendix B Figure

B16). The selected features were then sorted and rearranged by cluster labels so that similar features are

placed nearby. The features of test data were also rearranged in the same order.

Augmentation of sequencing profiles based on their visual patterns

DeepBioGen captures local visual patterns of sequencing profiles by training conditional Wasserstein GAN,

whose generator and critic networks are composed of up-convolutional and convolutional layers,

respectively. The generator tries to generate realistic images enough to fool the critic, whereas the critic

tries to assign higher values for real images than for generated images. During training, the generator and

the critic progressively become better at their jobs by competing against each other. This adversarial

training can be conducted by optimizing a minimax objective. Wasserstein distance (or Earth Mover)

formulated by Kantorovich-Rubinstein duality is used in the objective term for better reaching Nash

equilibrium [84]. Also, the gradient penalty is applied to the objective function to enforce the Lipschitz

30

constraint, alleviating potential instability in the critic [85]. Generator function 𝐺 and critic function 𝐶 are

conditioned on the class label 𝑦 and the final objective function of conditional Wasserstein GAN is as

follows:

min𝐺

max𝐶

𝔼𝑧~𝑝(𝑧)[𝐶(𝐺(𝑧|𝑦))] − 𝔼𝑥~𝑃𝑟[𝐶(𝑥|𝑦)] − 𝔼�̂�~𝑃�̂�

[(‖∇�̂�𝐶(�̂�|𝑦)‖2 − 1)2]

where 𝑧 denotes a random noise vector derived from random noise distribution 𝑝(𝑧) , 𝑥 a real profile

derived from the real data distribution 𝑃𝑟, and �̂�~𝑃�̂� sampling uniformly along straight lines connecting the

real data distribution 𝑃𝑟 and the output distribution of generator 𝑃𝑔 = 𝐺(𝑧|𝑦). The gradient penalty term

directly constrains the norm of the critic’s output concerning its input, enforcing the Lipschitz constraint

along the straight lines.

The architecture of neural networks that approximate generator function 𝐺 and critic function 𝐶 is

illustrated in Appendix B Figure B17. The generator begins with two input layers, one for receiving a

random noise vector and the other for a class label, followed by dense and embedding layers. Embedded

random noise vector and label vector are reshaped and concatenated. Subsequently, two up-convolutional

blocks, composed of an up-convolutional layer, batch normalization layer, and Leaky ReLU activation layer,

perform inverse convolution operations. Lastly, the final up-convolutional layer produces generated

sequencing profile. Note that each sequencing profile is considered as a 1x256 pixel image in a single

channel. Similarly, the critic has two input layers, one for sequencing profile and the other for a class label,

which is embedded, reshaped, and concatenated onto the sequencing profile vector. The two consecutive

convolutional blocks, each of which consists of a convolutional layer, Leaky ReLU activation, and dropout

layer, are followed by the output layer with a single unit. Across the generator and critic, the alpha value of

Leaky ReLU is set as 0.3, and the dropout rate is set at 0.3.

To achieve better generalization, multiple clones of the GAN are trained in the same way except for initial

weights in the neural networks. The number of desired GANs is estimated by approximating modes of

samples with the elbow method under the assumption that most modes are generated if the number of

generators is at least as many as the number of modes in source data (Appendix B Figure B18). Individual

generators produce the same number of augmented data points.

Generalized predictions on unseen sequencing profiles

31

To generalize classifiers predicting clinical outcomes or disease states to unseen data, three classifiers, SVM,

NN, and RF, were built on training data composed of source and augmented sequencing profiles. Hyper-

parameters of the classifiers were optimized based only on source data with a 5-fold cross-validation

scheme. Grid search was applied to explore hyper-parameter space (see details in Appendix B Table B2).

With the best hyper-parameters, prediction models were trained on the pooled source and augmented data.

The generalizability and performance of the prediction models were evaluated on the unseen test data using

AUROC and AUPRC. The performance evaluation was repeated by gradually changing the augmentation

rate indicating how many times the size of augmented data is of the source data.

For comparison, state-of-the-art classifiers designed to work on unseen data, including TIDE [73], IMPRES

[74], and DeepMicro [75], were evaluated on test data. TIDE predicts anti-PD1 responsiveness of

melanoma patients based on genome-wide expression signatures of T cell dysfunction and exclusion. To

satisfy its requirement, the test data without filtering out any genes from the original data was submitted to

the TIDE response prediction web service. IMPRES is a predictor of anti-PD1 response in melanoma

patients, which is a rule-based classifier manually built based on gene expression relationships between

immune checkpoint gene pairs. Its source code was utilized to evaluate the performance of IMPRES on the

test data. DeepMicro is a deep representation learning framework for improving predictors based on

microbiome profiles. The source data was utilized to learn a low-dimensional representation of the

microbiome data, and classifiers were then trained on the representation and evaluated on the test data.

Furthermore, as an alternative to DeepBioGen, widely-used data augmentation approaches, including

GMM [76] and SMOTE [77], as well as statistics-based random augmentation were evaluated. An

independent GMM model was fitted for each class label, and the optimal number of components in the

GMM model was estimated with the Bayesian information criterion (BIC). SMOTE derives the generated

samples from linear combinations of nearest neighboring samples. Random augmentation draws data points

from the normal distribution whose mean and standard deviation are the same as those of the source data,

assigning an arbitrary class label. Also, as a baseline comparison, machine learning classifiers that are

trained only on source data (i.e., no augmented data) were evaluated on test data.

To understand the impact of generalization on reducing the discrepancy between the source and test data,

a classifier-induced divergence measure, ℋ-divergence, was determined with various classifiers. For a

given set of binary hypotheses ℋ ⊆ {ℎ: 𝑋 → {0,1}} , ℋ -divergence is the largest possible difference

between probabilities of being classified as 1 in source and test distributions [86, 87]. More formally, the

empirical ℋ-divergence can be written as:

32

𝑑ℋ(𝐷𝑆 , 𝐷𝑇) = 2 supℎ∈ℋ

|𝑃𝑥~𝐷𝑆[ℎ(𝑥) = 1] − 𝑃𝑥~𝐷𝑇

[ℎ(𝑥) = 1]|

where 𝐷𝑆 and 𝐷𝑇 are the source and test data, respectively, and

𝑃𝑥~𝐷[ℎ(𝑥) = 1] =|{𝑥: 𝑥 ∈ 𝐷, ℎ(𝑥) = 1}|

|𝐷|

As a proxy of ℋ for each augmentation method, all classifiers trained on the augmented training data by

varying an augmentation rate and classification algorithms were included in a set of binary hypotheses.

Impact of multiple generators on the diversity of generated sequencing profiles

Wasserstein GAN may suffer less from mode collapse than infant GAN relying on Jensen-Shannon

divergence in its loss term [84]. However, a single Wasserstein GAN may not be able to produce all modes

of data, and it can be hypothesized that multiple Wasserstein GANs may increase the diversity of augmented

sequencing profiles. To evaluate the diversity of the augmented profiles generated with multiple

Wasserstein GANs, the adapted inception score is used. Originally, the inception score was introduced to

evaluate the quality and diversity of generated images based on the predicted class probability distributions

derived from a pre-trained Inception v3 model [88]. More recently, Gurumurthy et al. suggested a modified

inception score considering within-class diversity of the generated data [89], and this scoring method is

used in the current evaluation. Also, according to the note that non-ImageNet data generator should not be

evaluated by the Inception v3 classifier [90], it is replaced with the best performing baseline-classifier

trained only on source data. Consequently, the adapted inception score ranges from 1 to 2, and the higher

the score, the better the diversity and quality of the augmented profiles.

t-SNE visualization of the augmented data

To visualize how augmented data is arranged in a high-dimensional space, the augmented data along with

source and test data was embedded into a 2-dimensional space using t-SNE. Also, a class-specific boundary

of the source data cluster facing the test data cluster in the embedded space was drawn with one or two

straight lines through the outermost data points of the source data cluster.

Progression-free survival analysis

33

The Kaplan-Meier plots were drawn to conduct progression-free survival analysis for predicted responder

and non-responder patients. For each classifier, a receiver operating characteristic (ROC) curve was used

to determine the cut-off value of predictions. The closest point from (0, 1) on the ROC curve was chosen,

at which the threshold well balancing true positive rate and false-positive rate is identified. The log-rank

test was used to validate statistical significance.

34

Chapter 3: Deep generalized interpretable autoencoder elucidates gut

microbiota for better cancer immunotherapy

3.1 Introduction

Recent studies have found that the composition of the gut microbiome modulates the response to cancer

immunotherapies [91-93]. Immune checkpoint inhibitors (ICIs) that block immunosuppressive molecules

of tumor cells, thereby inducing host immune response are highly effective for only a subset of patients

(~40%) [94]. The gut microbiome has been reported as a major extrinsic modulator to responses of ICIs

such as anti-PD-1. In mice, fecal microbiota transplantation (FMT) from responders to nonresponders

promotes the efficacy of anti-PD-1 therapy in nonresponders [91-93]. More recently, first-in-human clinical

trials observed the clinical benefit of responder-derived FMT in melanoma patients [95, 96]. Although a

favorable gut microbiome is associated with response to anti-PD-1 therapy, its composition and the specific

mechanisms affecting host immune response remain unclear [97].

Determining the key microbiota affecting individual responses to cancer treatment is crucial for advancing

precision oncology. However, this is challenging due to the limited available data sets, thereby lack of

generalizability in statistical and machine learning models. For example, multiple studies on small

melanoma cohorts have reported gut bacteria associated with response to ICI therapy [91, 92, 98-100], but

unfortunately, there are discrepancies in the findings [97]. Many bacteria reported by those studies did not

appear in multiple studies at the species level except Faecalibacterium prausnitzii and Bacteroides

thetaiotaomicron. Also, previous attempt to train machine learning classifiers on microbiome profiles has

shown relatively low accuracy in the prediction of ICI response on unseen data [101]. This suggests the

need for curation of massive-scale studies to obtain statistical power to generalize microbial signatures to

unseen data.

Nevertheless, recent advances in artificial intelligence, especially deep learning models for domain

generalization may hold promise in generalizing microbial signatures. Domain generalization, also called

out-of-distribution generalization, aims at learning models that can be generalized to an unseen domain

without any foreknowledge [102]. Domain generalization techniques usually require data from multiple

domains or sufficient enough to simulate domain shifts, and the limited availability of microbiome data

often restricts the application of the techniques. However, more recent studies proposed data augmentation

approaches, circumventing the limitation. Especially, DeepBioGen showed promise in augmenting limited

sequencing data, including microbiome profiles, and improving the generalizability of classification models.

35

Well-generalized and accurate deep learning models have the potential to be a key part of clinical decision-

making in precision medicine [103, 104]. Despite the remarkable performance, deep learning models are

usually black-box and difficult to interpret, which hampers their adoption in clinical practice as clinicians

and decision-makers prioritize the explainability of the predictions [105]. Also, interpretable models may

provide useful insight into the underlying mechanisms connecting gut microbiome and host immune

response.

In this study, DeepGeni, a deep generalized interpretable autoencoder, is proposed to unveil the gut

microbiome associated with ICI response (Figure 7). The previous study has shown that a deep autoencoder

can produce a highly effective representation of microbiome profiles [75]. Also, a flexible autoencoder

model has been developed for interpretable autoencoding without a significant loss of reconstruction

accuracy [106]. By augmenting microbiome profiles with DeepBioGen and by introducing explainable

links in the autoencoder, DeepGeni improved the generalizability and interpretability of the learned

representation of microbiome profiles. DeepGeni-based classifiers outperform a state-of-the-art classifier

in predicting ICI response using microbiome profiles. Also, interpretable links of DeepGeni reveal

important taxa for ICI response prediction, and the identified taxa are either associated with prolonged

progression-free survival in melanoma patients treated with ICI therapy or differentially abundant between

responders and non-responders.

36

Figure 7. Overview of DeepGeni analysis

3.2 Methods

Datasets

Gut microbiome data of melanoma patients treated with ICI therapy were collected from four shotgun

metagenomic studies [91, 92, 99, 107]. This study focused on samples gathered before ICI therapy and

excluded the other samples taken after ICI administration. Patients’ responsiveness to ICI therapy was

37

evaluated with RECIST 1.1 criteria where complete or partial responses are classified as responders and

stable or progressive disease states as non-responders [108]. Since Peters et al.’s data did not have an

explicit classification of responsiveness, patients with over 6 months of progression-free survival were

regarded as responders and the others as non-responders as suggested by Limeta et al. [101]. In total, 130

melanoma patients (66 responders and 64 non-responders) were used (Table 6).

Raw sequencing reads were filtered with FASTP and processed with mOTUs2, a phylogenetic z (mOTU)

profiler [109, 110]. Processed microbiome profiles containing read counts for each phylogenetic marker

gene and each patient were acquired from Limeta et al. [101]. Read counts were normalized by the total

number of reads for each patient, and then log2-transformed. In total, 7,727 mOTUs (features) were

considered in an initial input.

Table 6. Summary of gut microbiome datasets derived from shotgun metagenomic sequencing

Dataset name # of total

samples

# of

responders

# of non-

responders

Publis

hed

year

ICI therapy Referenc

e

Gopalakrishnan 25 14 11 2018 Anti-PD-1 [91]

Matson 39 15 24 2018 Anti-PD-1 [92]

Frankel 39 19 20 2017 Anti-PD-1, Anti-CTLA-

4, and both

[99]

Peters 27 18 9 2019 Anti-PD-1, Anti-CTLA-

4, and both

[107]

Microbiome profile augmentation with DeepBioGen

DeepGeni utilizes DeepBioGen, a sequencing profile augmentation procedure that generalizes the

subsequent trainable models with the augmented data (Figure 7a). Visual patterns of source microbiome

profiles are established with feature selection followed by feature-wise clustering. Wasserstein generative

adversarial network (GAN) equipped with convolutional layers capturing the visual patterns generates

realistic profiles and augments source data. The augmented training data can enhance the generalizability

38

of the subsequent models such as machine learning classifiers to unseen data. In this study, DeepBioGen

parameters were set to default, otherwise, configured following the guideline described in the original paper.

Test data has been excluded from any estimation of the parameters. Out of 7,727 mOTU features, 256

features were selected by fitting extremely randomized trees on source data [83]. The number of feature-

wise clusters and the number of GAN models were estimated by calculating the within-cluster sum of

squared errors in source data with reduced features.

Generalized autoencoder with interpretable links

Autoencoder consists of encoder and decoder functions that are approximated by neural networks. The

encoder maps the input data points into latent space and the decoder reconstructs the input from the mapped

latent representations. During training, the autoencoder tries to minimize the gap between the input and the

reconstruction by adjusting weights of neural networks based on back-propagated signals from

reconstruction loss term. Formally, the reconstruction loss can be written as,

𝐿(𝑥, 𝑥′) = ‖𝑥 − 𝑥′‖2 = ‖𝑥 − 𝑓′𝜃 (𝑓𝜙(𝑥))‖2,

where 𝑥 and 𝑥′ are the input and the reconstruction, 𝑓𝜙(∙) and 𝑓′𝜃(∙) are encoder and decoder functions in

which 𝜙 and 𝜃 are their weights, respectively. The latent representation usually has a smaller dimension

than the original input but it contains concentrated information that can be used to reconstruct the original

input with minimal error. Although the latent representation may hold essential information in a condensed

form, it is not directly interpretable because of the non-linear relationship between latent and original

features.

Svensson et al. suggested a flexible autoencoder model removing non-linearity in decoder function, opening

up a possibility to retain interpretability without ruining reconstruction quality [106]. The non-linearity of

the autoencoder comes from a non-linear activation function applied to the weighted sum of the preceding

inputs. By removing the activation function in the decoder part, direct linear links from the latent layer to

the output layer can be obtained. In this study, simple autoencoder architectures composed of three dense

layers were utilized: input layer, latent layer, and output layer. The number of nodes of the input and output

layer is the same as that of the input. Four different sizes of latent nodes were examined: 128, 64, 32, and

16. The augmented training data consisting of source and augmented data was used to train the autoencoder.

After training, the encoder part was used to produce latent representations of the augmented training data.

Test data was isolated from any steps of autoencoder training.

39

Generalized latent representations for predicting ICI responses

To estimate the usefulness of the latent representations derived from the generalized autoencoder,

prediction models classifying ICI responses were built on the representations (Figure 7b). Three machine

learning algorithms were used to train the models: support vector machine (SVM), random forest (RF), and

feedforward neural network (NN) that is a multi-Layer perceptron. Prediction performance was evaluated

on two different validation settings. The first one, according to the suggestion of Limeta et al., utilizes the

most recent data set (Peters) as test data and the integration of the rest as source data. The other setting is

cross-study validation that iterates over datasets, leaves one dataset as test data, uses the rest as source data,

and averages over results. In both settings, five-fold cross-validation on the learned representation of source

data was conducted to optimize hyper-parameters of the classification algorithms. Hyper-parameter space

was explored with grid search and the parameter grid is described in Appendix C Table C1. With the best

hyper-parameters, classifiers were trained on representations of the entire source data and evaluated on test

data. Area under the receiver operating characteristics curve (AUC) was used to assess the prediction

performance.

Extracting informative microbiota from interpretable autoencoder

To interpret the latent representations that improve the prediction of ICI response, the most informative

latent variables were selected based on feature importance estimated by extremely randomized trees [83].

The informative signals of the selected latent variables were propagated through direct links in the decoder

network (Figure 7c). Out of 128 latent variables, ten of the most informative variables were considered for

further analysis. For each variable, the links were ranked by the absolute value of their weights and, out of

256 links, the top 20 were selected. After the corresponding output nodes connected to the top 20 links were

mapped to mOTUs in a one-to-one manner, the specified 20 mOTUs were listed into a set of candidates.

By iterating over the ten latent variables, the ten sets of candidates were merged into a unique set of

candidates. The whole process was repeated four times by dropping one data set at a time and using the rest

for better generalizability. The finalist was acquired by taking the intersection of the four sets of candidates

and it contains 14 mOTUs.

Statistical Analysis

40

To assess the impact of the identified informative mOTUs on ICI responsiveness, progression-free survival

analysis that is a primary endpoint of clinical oncology studies was conducted. Peters et al. (N=27) had

continuously followed the duration of progression-free survival and was used in the analysis. For each

mOTU, the second quartile (median) was used as a cut-off for high abundance. The Kaplan-Meier plot was

drawn and the log-rank test was conducted to validate statistical significance. For testing differentially

abundant taxa, the Wilcoxon rank-sum or Mann-Whitney test was used.

3.3 Results

Improved prediction of ICI response with generalized interpretable autoencoder

We evaluated the prediction performance of machine learning classifiers utilizing DeepGeni, a deep

generalized interpretable autoencoder. The classifiers were learned to predict a binary class of ICI treatment

(responder vs non-responder) based on the latent representation of microbiome profiles. Test data has been

excluded from the whole process of generalizing and training the autoencoder of which encoder part

produces the latent representation. DeepGeni-based classifiers were compared to classifiers trained on three

different settings without augmentation: 1) Initial data of 7,727 mOTU features without feature selection

or latent encoding, 2) Feature selected data (256 mOTU features) without latent encoding, 3) Feature

selected data with latent encoding. For each approach, out of three classification algorithms (SVM, RF, and

NN), the best performing one was selected. Also, the state-of-the-art approach that selects differentially

abundant mOTU features and applies a random forest classification algorithm was included in the

comparison. As an independent validation setting, the most recent study’s data (Peters) was used as test

data and the rest as source data for training classifiers.

41

Figure 8. Receiver operating characteristics (ROC) curves of the best classifier for each method

Remarkably, the DeepGeni-based NN classifier surpasses not only the state-of-the-art classifier (Limeta et

al.) but the best classifiers of other approaches in (Figure 8). In addition, the rest of the DeepGeni-based

classifiers (SVM and RF) show better performance than the classifiers of other approaches (Appendix C

Table C2). Also, the DeepGeni-based SVM classifier outperforms other classifiers in the cross-study

validation setting, displaying the highest generalizability across different studies (Table 7).

Table 7. Averaged AUC in cross-study validation setting

Approach No FS FS only FS + AE DeepGeni

(FS + DBG + AE)

Algorithm SVM RF NN SVM RF NN SVM RF NN SVM RF NN

AUC 0.52 0.522 0.556 0.564 0.551 0.585 0.602 0.57 0.598 0.626 0.579 0.609

STD 0.156 0.074 0.07 0.107 0.103 0.08 0.06 0.053 0.045 0.209 0.09 0.221

- FS: feature selection; AE: autoencoder; DBG: DeepBioGen

Key microbiota relevant to ICI response extracted from generalized interpretable autoencoder

The finalist of ICI-response-relevant key microbiota was identified by propagating informative signals

through the interpretable links from latent variables that play a major role in inducing the superior ICI

response prediction. The finalist consisting of fourteen mOTUs categorized into seven families were

validated with previous literature and statistical tests. Previous studies have reported twelve of the fourteen

in the upper taxonomic levels. However, the finalist generally shows a higher resolution of microbiota

associated with ICI therapy in taxonomic identification (Table 8). Interestingly, two novel ICI-therapy-

relevant gut bacteria, Eggerthella lenta and unknown Lactobacillales, were identified that were not detected

in previous studies. It is worth noting that the genus Subdoligranulum is closely related to the

Faecalibacterium genus Furthermore, five species, including Lactobacillus plantarum, unknown

Ruminococcaceae, and three unknown Clostridiales, displayed statistical significance in differentially

42

abundant testing (unadjusted, Wilcoxon’s rank-sum test). Besides, a high abundance of unknown

Eubacterium species was significantly associated with prolonged progression-free survival in ICI-treated

melanoma patients (Figure 9).

Table 8. The finalist of ICI-response-relevant key microbiota

mOTU_v2

ID

Consensus

taxonomy Order Family Genus

Specifi

ed level

Prev

level

H-

Re

s

P-

val

ref_mOTU

_v2_0036

Enterobacteriaceae

sp. Enterobacteriales Enterobacteriaceae

Escherichia

/Shigella Species

Species

[92] -

ref_mOTU

_v2_0154

Lactobacillus

plantarum Lactobacillales

Lactobacillaceae Lactobacillus Species Family

[92]

Ye

s *

meta_mOTU

_v2_6288

unknown

Lactobacillales

unknown

Lactobacillales unknown Family - -

ref_mOTU

_v2_0642 Eggerthella lenta Eggerthellales Eggerthellaceae Eggerthella Species - -

ref_mOTU

_v2_0884

Anaerotruncus

colihominis

Clostridiales

Ruminococcaceae

Anaerotruncus Species Family

[91]

Ye

s

ref_mOTU

_v2_4738

Subdoligranulum

sp. Subdoligranulum Species

Family

[91]

Ye

s

ref_mOTU

_v2_0281

Ruminococcus

lactaris Ruminococcus Species

Genus

[91, 98]

Ye

s

meta_mOTU

_v2_6557

unknown

Ruminococcaceae

unknown

Ruminococcaceae Genus

Family

[91]

Ye

s **

meta_mOTU

_v2_6657

unknown

Eubacterium Eubacteriaceae Eubacterium Species

Genus

[91, 98]

Ye

s #

meta_mOTU

_v2_5411

unknown

Clostridiales

unknown

Clostridiales unknown Family

Order

[91]

Ye

s

meta_mOTU

_v2_5669

unknown

Clostridiales *

meta_mOTU

_v2_6760

unknown

Clostridiales *

meta_mOTU

_v2_6795

unknown

Clostridiales *

meta_mOTU

_v2_7550

unknown

Clostridiales

- *: p < 0.05, Wilcoxon’s rank-sum test on differential abundance; **: p <0.01, Wilcoxon’s rank-sum test; #: p < 0.05, log-rank

test on progression-free survival distribution difference; H-Res indicates whether the specified taxonomic level is in higher

resolution than the previously specified level in other studies.

43

Figure 9. Kaplan-Meier plot of progression-free survival by relative abundance of unknown Eubacterium

species

3.4 Discussion

DeepGeni is a generalized interpretable autoencoder not only boosting ICI response prediction accuracy in

an independent study but providing interpretable links to identify informative taxa contributory to modulate

ICI response. The improved generalizability of DeepGeni is supposed to be derived from augmented

microbiome data generated by DeepBioGen, a GAN-based data augmentation procedure. The latent

representation learned by the generalized autoencoder with the augmented data can enable to train

classifiers more resilient to unseen data distributions. Also, DeepGeni extracted microbial species

informative to predict ICI response in higher resolution than other studies. The specified species could be

a helpful basis for establishing ICI-promoting FMT guidelines to specify donor and donee. Moreover, the

identified species may offer a possibility to develop pre or probiotics targeting improved outcomes of ICI

therapy.

Although this study produces the generalized list of ICI-response-relevant key microbial taxa over the

available datasets, the ability to statistically validate the identified microbial taxa is bounded by the size of

44

the available data. This could limit the possibility of being validated for some of the key microbial taxa as

they were identified by taking advantage of the out-of-distribution augmented data and it may not be eligible

to use the augmented data for statistical validation. However, there still remains the possibility of being

validated in larger data sets once they become available.

DeepGeni was specifically applied to examine microbiome modulating ICI response in this study but it is

highly extensible for identifying microbiome-driven human phenotypes or even for applying other types of

biological and ecological data such as genome and metagenome profiles.

45

Conclusion

In this thesis, various deep learning models were developed to address the limitations of utilizing omics

data to promote precision medicine. These models were learned to produce effective secondary data

improving classification performance and interpretability of predicted outcome. The achievements may

facilitate the adoption of novel classification techniques and, therefore, the establishment of the standard

clinical decision-making process in precision medicine. The main conclusive deliverables and prospects of

this thesis are listed as follows:

1. DeepMicro is publicly available software that offers cutting-edge deep learning techniques for

learning meaningful representations of the given data. Researchers can apply DeepMicro to their

high-dimensional microbiome data to obtain a robust low-dimensional representation for the

subsequent supervised or unsupervised learning. For problems such as drug response prediction,

forensic human identification, and food allergy prediction using microbiome data, deep

representation learning might be useful for boosting the model performance. Moreover, it might be

worthwhile to use the learned representation for clustering analysis. Data points in the latent space

can be clustered, which may help capture the shared characteristics within groups that may not be

clear in the original data space. DeepMicro has been used to deal with microbiome data but can be

extended to various omics data such as genome and proteome data.

2. DeepBioGen provides a framework for effective data augmentation in sequencing profiles that can

be used to boost the training data and improve the performance of prediction models on unseen

data. It adversarially learns multiple generative models that capture visual signals from source data.

With multiple generators, DeepBioGen generates realistic augmented data beyond the boundary of

the source domain. The augmented data can be used to amplify training data and train classifiers

resilient to unknown domain shifts. Consequently, DeepBioGen can improve the transferability and

reproducibility of the prediction models without any knowledge of unseen data. In the future study,

it is envisioned that the process of forming visual patterns from sequencing profiles can be learned

with cutting-edge machine learning models toward the better formation of machine-understandable

patterns.

3. DeepGeni is a generalized interpretable autoencoder not only boosting ICI response prediction

accuracy in an independent study but providing interpretable links to identify informative taxa

46

contributory to modulate ICI response. DeepGeni was specifically applied to examine microbiome

modulating ICI response in this study but is highly extensible for identifying microbiome-driven

human phenotypes or even for applying other types of biological and ecological data such as

genome and metagenome profiles. Also, for these profiles, DeepGeni can provide a reasonable

explanation from black-box models with interpretable links. In the future, it is envisioned that

interpretable links are extended into the subsequent classification models and individualized

explanations for each outcome of the prediction.

47

Appendix A

Contents

• Figure A1. Performance evaluation scheme

• Figure A2. Disease prediction performance for abundance profile-based models

• Figure A3. Disease prediction performance for different autoencoders based on abundance

profile (assessed with AUC)

• Figure A4. Disease prediction performance of multi-layer perceptron without representation

learning based on marker profile

• Figure A5. Disease prediction performance of multi-layer perceptron without representation

learning based on abundance profile

• Figure A6. Impact of introducing negative samples into the training set on AUC

• Figure A7. Prediction performance changes over the increasing data points in the training set

• Figure A8. Disease prediction performance for marker profile-based models (fixed scale)

• Table A1. The best representation learning model structures for each dataset

• Table A2. Hyper-parameters used in grid search

• Table A3. Performance evaluation with area under precision-recall curve for IBD dataset

48

Figure A1. Performance evaluation scheme

49

Figure A2. Disease prediction performance for abundance profile-based models. Prediction performance

of various methods built on marker profile has been assessed with AUC. MetAML utilizes support vector

machine (SVM) and random forest (RF), and the superior model is presented (green). Principal component

analysis (PCA; blue) and gaussian random projection (RP; yellow) have been applied to reduce dimensions

of datasets before classification. DeepMicro (red) applies shallow autoencoder (SAE), deep autoencoder

(DAE), variational autoencoder (VAE), and convolutional autoencoder (CAE) for dimensionality reduction.

Then SVM, RF, and multi-layer perceptron (MLP) classification algorithms have been used.

50

Figure A3. Disease prediction performance for different autoencoders based on abundance profile

(assessed with AUC). Classifiers used: support vector machine (SVM), random forest (RF), and multi-layer

perceptron (MLP); Autoencoders used: shallow autoencoder (SAE), deep autoencoder (DAE), variational

autoencoder (VAE), and convolutional autoencoder (CAE)

51

Figure A4. Disease prediction performance of multi-layer perceptron without representation learning

based on marker profile

52

Figure A5. Disease prediction performance of multi-layer perceptron without representation learning

based on abundance profile

53

Figure A6. Impact of introducing negative samples into the training set on AUC

54

Figure A7. Prediction performance changes over the increasing data points in the training set

55

Figure A8. Disease prediction performance for marker profile-based models (fixed scale).

56

Table A1. The best representation learning model structures for each dataset

Microbiome

Profile Type Dataset

Size of Original

Dim#

Representation

Learning Model

Encoder

Structure*

Size of

Latent Dim Classifier

Averaged AUC

(Standard Error)

Averaged Accuracy

(Standard Error)**

Strain-level

marker

profile

IBD 91,756

SAE 64 64 SVM 0.955 (0.013) 0.773 (0.000)

DAE 512-256-128 128 RF 0.911 (0.046) 0.855 (0.027)

VAE 128-4 4 MLP 0.899 (0.039) 0.818 (0.014)

CAE 8-4 1,936 RF 0.929 (0.010) 0.882 (0.011)

EW-T2D 83,456

SAE 256 256 RF 0.899 (0.046) 0.800 (0.047)

DAE 256-128-64 64 RF 0.840 (0.029) 0.730 (0.041)

VAE 256-16 16 SVM 0.853 (0.041) 0.600 (0.039)

CAE 8-4 1,764 SVM 0.796 (0.014) 0.670 (0.030)

C-T2D 119,792

SAE 512 512 SVM 0.762 (0.008) 0.664 (0.021)

DAE 256-128 128 RF 0.702 (0.029) 0.649 (0.019)

VAE 128-16 16 SVM 0.719 (0.019) 0.664 (0.022)

CAE 4-2 968 MLP 0.763 (0.014) 0.710 (0.008)

Obesity 99,568

SAE 512 512 MLP 0.658 (0.045) 0.624 (0.027)

DAE 256-128 128 RF 0.659 (0.034) 0.635 (0.012)

VAE 512-8 8 RF 0.599 (0.014) 0.639 (0.013)

CAE 64-32 16,928 RF 0.622 (0.012) 0.655 (0.008)

Cirrhosis 120,553

SAE 256 256 SVM 0.928 (0.006) 0.821 (0.020)

DAE 512-256-128 128 SVM 0.903 (0.011) 0.809 (0.012)

VAE 256-8 8 SVM 0.891 (0.016) 0.792 (0.029)

CAE 16-8 3,872 SVM 0.940 (0.006) 0.864 (0.008)

Colorecta

l 108,034

SAE 32 32 MLP 0.799 (0.058) 0.752 (0.039)

DAE 512-256-128 128 MLP 0.803 (0.072) 0.728 (0.046)

VAE 256-8 8 RF 0.737 (0.068) 0.696 (0.037)

CAE 4-2-1 441 MLP 0.789 (0.044) 0.744 (0.033)

Species-

level

relative

abundance

profile

IBD 443

SAE 512 512 MLP 0.817 (0.031) 0.782 (0.017)

DAE 512-256 256 MLP 0.779 (0.039) 0.791 (0.037)

VAE 32-8 8 RF 0.779 (0.032) 0.782 (0.017)

CAE 32-16-8 3,872 RF 0.873 (0.030) 0.809 (0.017)

EW-T2D 381

SAE 256 256 SVM 0.640 (0.033) 0.630 (0.037)

DAE 1024-512 512 SVM 0.612 (0.060) 0.580 (0.026)

VAE 64-8 8 RF 0.640 (0.051) 0.570 (0.047)

CAE 16-8 3,200 RF 0.829 (0.039) 0.740 (0.037)

C-T2D 572

SAE 64 64 SVM 0.715 (0.023) 0.635 (0.030)

DAE 128-64 64 SVM 0.711 (0.026) 0.649 (0.026)

VAE 512-16 16 SVM 0.715 (0.031) 0.652 (0.031)

CAE 4-2-1 576 RF 0.725 (0.025) 0.644 (0.025)

Obesity 465

SAE 128 128 MLP 0.645 (0.030) 0.659 (0.017)

DAE 1024-512 512 MLP 0.631 (0.051) 0.612 (0.020)

VAE 256-4 4 MLP 0.600 (0.030) 0.635 (0.012)

CAE 4-2 968 RF 0.674 (0.034) 0.655 (0.013)

Cirrhosis 542

SAE 32 32 SVM 0.801 (0.035) 0.723 (0.050)

DAE 1024-512 512 MLP 0.806 (0.017) 0.706 (0.030)

VAE 512-8 8 SVM 0.781 (0.021) 0.711 (0.035)

CAE 16-8-4 1,461 RF 0.888 (0.011) 0.830 (0.029)

Colorecta

l 503

SAE 256 256 SVM 0.712 (0.052) 0.672 (0.037)

DAE 256-128 128 SVM 0.728 (0.056) 0.648 (0.046)

VAE 512-8 8 SVM 0.739 (0.070) 0.632 (0.037)

CAE 8-4 2,116 RF 0.809 (0.046) 0.704 (0.020) #Dim: Dimension; SAE: Sallow Autoencoder; DAE: Deep Autoencoder; VAE: Variational autoencoder; CAE: Convolutional autoencoder; SVM:

Support Vector Machine; RF: Random Forest; MLP: Multi-layer Perceptron *The number of units for SAE, DAE, and VAE; The number of filters for CAE; Layers are separated by a delimiter “-” **Note that as the models are optimized for AUC performance, not accuracy, it is required to re-train our models by optimizing accuracy if you

need to directly compare the accuracy performance with your models.

57

Table A2. Hyper-parameters used in grid search

Purpose Method

Hyper-parameter tuned with

grid search Used values

Learning

Representation

SAE Size of latent layer 32, 64, 128, 256, 512

DAE

Size of latent layer 32, 64, 128, 256, 512

# of hidden layers in both

encoder and decoder 1, 2

VAE

Size of latent layer 4, 8, 16

# of hidden units in the

hidden layers 32, 64, 128, 256, 512

CAE

# of convolutional layers 2, 3

# of filters in the first conv

layer 4, 8, 16, 32, 64

Learning

Classifier

SVM Penalty parameter C 2-5, 2-3, 2-1, 2, 23, 25

RBF kernel coefficient 2-15, 2-13, 2-11, 2-9, 2-7, 2-5, 2-3, 2-1, 2, 23

RF

# of trees (estimators) 100, 300, 500, 700, 900

The minimum number of

samples in a leaf node 1, 2, 3, 4, 5

Split criteria Gini impurity, information gain

MLP

# of hidden layers 1, 2, 3

# of hidden units in the first

layer 10, 30, 50, 100

Dropout rate 0.1, 0.3

# of epochs 30, 50, 100, 200, 300 # SAE: Sallow Autoencoder; DAE: Deep Autoencoder; VAE: Variational autoencoder; CAE: Convolutional autoencoder; SVM: Support Vector

Machine; RF: Random Forest; MLP: Multi-layer Perceptron

58

Table A3. Performance evaluation with area under precision-recall curve for IBD dataset

Microbiome

profile type Methods

Representation

Learning Classifier

AUC*

(Standard Error)

AUPRC**

(Standard Error)

Strain-level

marker profile

DeepMicro SAE SVM 0.9553 (0.013) 0.8653 (0.035)

MetAML . RF 0.8918 (0.033) 0.6770 (0.102)

PCA-based PCA MLP 0.9223 (0.024) 0.7965 (0.059)

RP-based RP RF 0.7882 (0.044) 0.5461 (0.079)

Species-level

abundance

profile

DeepMicro CAE RF 0.8659 (0.033) 0.7020 (0.064)

MetAML . RF 0.9153 (0.037) 0.7915 (0.076)

PCA-based PCA RF 0.8247 (0.034) 0.6220 (0.021)

RP-based RP RF 0.7365 (0.052) 0.4980 (0.075) # SAE: Sallow Autoencoder; CAE: Convolutional Autoencoder; PCA: Principal Component analysis; RP: Random Projection; SVM: Support

Vector Machine; RF: Random Forest; MLP: Multi-layer Perceptron

*AUC: Area Under the receiver operating characteristic (ROC) Curve

**AUPRC: Area Under the Precision-Recall Curve

59

Appendix B

Contents

• Figure B1. Visualization of tumor gene expression profiles of melanoma patients.

• Figure B2. Visualization of microbiome marker profiles of diabetic patients and healthy controls.

• Figure B3. Prediction performance on unseen data (AUPRC).

• Figure B4. Prediction performance on unseen data with Random forest (RF) classifier.

• Figure B5. Anti-PD1 therapy response prediction performance (AUROC) on unseen data by

varying the number of visual clusters and that of GAN models.

• Figure B6. Anti-PD1 therapy response prediction performance (AUPRC) on unseen data by

varying the number of visual clusters and that of GAN models.

• Figure B7. Type 2 diabetes prediction performance (AUROC) on unseen data by varying the

number of visual clusters and that of GAN models.

• Figure B8. Type 2 diabetes prediction performance (AUPRC) on unseen data by varying the

number of visual clusters and that of GAN models.

• Figure B9. t-SNE visualization of augmented tumor expression profiles derived from Random

augmentation along with the source and test (unseen) data of melanoma patients treated with anti-

PD1 therapy.

• Figure B10. t-SNE visualization of augmented tumor expression profiles derived from GMM

along with the source and test (unseen) data of melanoma patients treated with anti-PD1 therapy.

• Figure B11. t-SNE visualization of augmented tumor expression profiles derived from SMOTE

along with the source and test (unseen) data of melanoma patients treated with anti-PD1 therapy.

• Figure B12. t-SNE visualization of augmented microbiome profiles derived from DeepBioGen

along with the source and test (unseen) data of diabetic patients and healthy controls.

• Figure B13. t-SNE visualization of augmented microbiome profiles derived from Random

augmentation along with the source and test (unseen) data of diabetic patients and healthy

controls.

• Figure B14. t-SNE visualization of augmented microbiome profiles derived from GMM along

with the source and test (unseen) data of diabetic patients and healthy controls.

• Figure B15. t-SNE visualization of augmented microbiome profiles derived from SMOTE along

with the source and test (unseen) data of diabetic patients and healthy controls.

• Figure B16. Feature-wise WSS by the number of clusters.

• Figure B17. Conditional Wasserstein GAN architecture in DeepBioGen.

• Figure B18. Sample-wise WSS by the number of clusters.

• Table B1. Summary of sequencing data sets.

• Table B2. Hyper-parameter grid for optimizing classifiers.

• Table B3. Modified inception scores of generated sequencing profiles varying the number of

conditional Wasserstein GANs.

61

Figure B1. Visualization of tumor gene expression profiles of melanoma patients. a-b, The columns are

unordered genes before pre-processing, and each row indicates the profile of responder (a) or non-

responder (b). c-d, The columns are re-ordered genes with 4 clusters derived from feature-wise clustering.

e-f, The augmented profiles generated by DeepBioGen.

63

Figure B2. Visualization of microbiome marker profiles of diabetic patients and healthy controls. a-b,

The columns are unordered genes before pre-processing, and each row indicates the profile of healthy

controls (a) or type 2 diabetes (b). c-d, The columns are re-ordered genes with 4 clusters derived from

feature-wise clustering. e-f, The augmented profiles generated by DeepBioGen.

64

Figure B3. Prediction performance on unseen data (AUPRC). a-b, Results of anti-PD1 therapy response

prediction on unseen data by the state-or-the-art and baseline classifiers (gray) and by classifiers

generalized with DeepBioGen (red), SMOTE (green), GMM (yellow), and Random augmentation (blue);

Classification algorithms: Support Vector Machine (SVM) and Neural network (NN) which is a multi-

layer perceptron; Evaluation metric: Area under the precision-recall curve (AUPRC). c-d, Results of type

2 diabetes prediction on unseen data.

65

Figure B4. Prediction performance on unseen data with Random forest (RF) classifier. a-b, Results of

anti-PD1 therapy response prediction on unseen data by the state-or-the-art and baseline classifiers (gray)

and by the classifier generalized with DeepBioGen (red), SMOTE (green), GMM (yellow), and Random

augmentation (blue); Evaluation metrics: Area under the receiver operating characteristics (AUROC) and

Area under the precision-recall curve (AUPRC). c-d, Results of type 2 diabetes prediction on unseen

data; Evaluation metric: AUROC and AUPRC.

66

Figure B5. Anti-PD1 therapy response prediction performance (AUROC) on unseen data by varying the

number of visual clusters and that of GAN models. a-c, Varying the number of visual clusters while

fixing the number of GANs as five; Yellow star denotes the estimated number of visual clusters. d-f,

Varying the number of GANs while fixing the number of visual clusters as four; Yellow start denotes the

estimated number of GANs.

67

Figure B6. Anti-PD1 therapy response prediction performance (AUPRC) on unseen data by varying the

number of visual clusters and that of GAN models. a-c, Varying the number of visual clusters while

fixing the number of GANs as five; Yellow star denotes the estimated number of visual clusters. d-f,

Varying the number of GANs while fixing the number of visual clusters as four; Yellow start denotes the

estimated number of GANs.

68

Figure B7. Type 2 diabetes prediction performance (AUROC) on unseen data by varying the number of

visual clusters and that of GAN models. a-c, Varying the number of visual clusters while fixing the

number of GANs as eight; Yellow star denotes the estimated number of visual clusters. d-f, Varying the

number of GANs while fixing the number of visual clusters as six; Yellow start denotes the estimated

number of GANs.

69

Figure B8. Type 2 diabetes prediction performance (AUPRC) on unseen data by varying the number of

visual clusters and that of GAN models. a-c, Varying the number of visual clusters while fixing the

number of GANs as eight; Yellow star denotes the estimated number of visual clusters. d-f, Varying the

number of GANs while fixing the number of visual clusters as six; Yellow start denotes the estimated

number of GANs.

70

Figure B9. t-SNE visualization of augmented tumor expression profiles derived from Random

augmentation along with the source and test (unseen) data of melanoma patients treated with anti-PD1

therapy. a, The source (gray) and test data (red). b, The source, test, and augmented data (green). c,

Responders of the source and test data; An empirical boundary of responders of source data (red dotted

line). d, Responders of the source, test, and augmented data. e, Non-responders of the source and test

71

data; An empirical boundary of non-responders of source data (red dotted line). f, Non-responders of the

source, test, and augmented data.

Figure B10. t-SNE visualization of augmented tumor expression profiles derived from GMM along with

the source and test (unseen) data of melanoma patients treated with anti-PD1 therapy. a, The source

(gray) and test data (red). b, The source, test, and augmented data (green). c, Responders of the source

and test data; An empirical boundary of responders of source data (red dotted line). d, Responders of the

72

source, test, and augmented data. e, Non-responders of the source and test data; An empirical boundary of

non-responders of source data (red dotted line). f, Non-responders of the source, test, and augmented data.

Figure B11. t-SNE visualization of augmented tumor expression profiles derived from SMOTE along

with the source and test (unseen) data of melanoma patients treated with anti-PD1 therapy. a, The source

(gray) and test data (red). b, The source, test, and augmented data (green). c, Responders of the source

and test data; An empirical boundary of responders of source data (red dotted line). d, Responders of the

73

source, test, and augmented data. e, Non-responders of the source and test data; An empirical boundary of

non-responders of source data (red dotted line). f, Non-responders of the source, test, and augmented data.

74

Figure B12. t-SNE visualization of augmented microbiome profiles derived from DeepBioGen along

with the source and test (unseen) data of diabetic patients and healthy controls. a, The source (gray) and

test data (red). b, The source, test, and augmented data (green). c, Healthy controls of the source and test

data; An empirical boundary of healthy controls of source data (red dotted line). d, Healthy controls of the

source, test, and augmented data. e, Type 2 diabetes patients of the source and test data; An empirical

75

boundary of type 2 diabetes patients of source data (red dotted line). f, Type 2 diabetes patients of the

source, test, and augmented data.

Figure B13. t-SNE visualization of augmented microbiome profiles derived from Random augmentation

along with the source and test (unseen) data of diabetic patients and healthy controls. a, The source (gray)

and test data (red). b, The source, test, and augmented data (green). c, Healthy controls of the source and

test data; An empirical boundary of healthy controls of source data (red dotted line). d, Healthy controls

76

of the source, test, and augmented data. e, Type 2 diabetes patients of the source and test data; An

empirical boundary of type 2 diabetes patients of source data (red dotted line). f, Type 2 diabetes patients

of the source, test, and augmented data.

77

Figure B14. t-SNE visualization of augmented microbiome profiles derived from GMM along with the

source and test (unseen) data of diabetic patients and healthy controls. a, The source (gray) and test data

(red). b, The source, test, and augmented data (green). c, Healthy controls of the source and test data; An

empirical boundary of healthy controls of source data (red dotted line). d, Healthy controls of the source,

test, and augmented data. e, Type 2 diabetes patients of the source and test data; An empirical boundary

78

of type 2 diabetes patients of source data (red dotted line). f, Type 2 diabetes patients of the source, test,

and augmented data.

Figure B15. t-SNE visualization of augmented microbiome profiles derived from SMOTE along with the

source and test (unseen) data of diabetic patients and healthy controls. a, The source (gray) and test data

(red). b, The source, test, and augmented data (green). c, Healthy controls of the source and test data; An

empirical boundary of healthy controls of source data (red dotted line). d, Healthy controls of the source,

79

test, and augmented data. e, Type 2 diabetes patients of the source and test data; An empirical boundary

of type 2 diabetes patients of source data (red dotted line). f, Type 2 diabetes patients of the source, test,

and augmented data.

80

Figure B16. Feature-wise WSS by the number of clusters. a, RNA-seq tumor expression profiles

(optimum: 4). b, WGS human gut microbiome marker profile (optimum: 6).

81

Figure B17. Conditional Wasserstein GAN architecture in DeepBioGen. a, A generator network

generates realistic profiles from random noise. b, A critic network distinguishes realistic profiles from the

real.

82

Figure B18. Sample-wise WSS by the number of GANs. a, RNA-seq tumor expression profiles (optimal:

5). b, WGS human gut microbiome marker profile (optimal: 8).

83

Table B1. Summary of sequencing data sets.

Data type Year # of

Samples

# of

class 0*

# of

class

1**

Sequencing

platform Reference

RNA-seq

tumor

expression

profile

Source

2016 28 15 13 Illumina Hiseq

2000 Hugo et al.[81]

2017 98 54 44 Illumina Hiseq

2000/2500 Riaz et al.[82]

Test 2019 50 30 20 Illumina Hiseq

2500 Gide et al.[78]

WGS

human gut

microbiome

profile

Source 2012 344 174 170 Illumina Genome

Analyzer II Qin et al.[32]

Test 2014 96 43 53 Illumina HiSeq

2000

Karlsson et

al.[31] *Responders of anti-PD1 therapy for tumor expression profile or healthy controls for microbiome profile

**Non-responders of anti-PD1 therapy or type 2 diabetes for microbiome profile

84

Table B2. Hyper-parameter grid for optimizing classifiers.

Classification

algorithm Hyper-parameter Parameter grid

SVM

Kernel Linear and radial basis function (RBF)

Regularization penalty

C 2-4, 2-3, 2-2, 2-1, 20, 21, 22, and 24

Gamma ‘Scale’ (= 1/(n_features*X.var()) and

‘Auto’ (=1/n_features)

RF

# of estimators 27, 28, 29, and 210

Maximum # of features

for the best split Square root and log2 of n_features

Split criterion Gini impurity and information gain

NN

Hidden layers

(hidden units)

3 layers (128, 64, 32),

4 layers (128, 64, 32, 16), and

5 layers (128, 64, 32, 16, 8)

Learning rate

Constant (0.001),

invscaling (0.001/ pow(t, power_t) where t is time step),

and

adaptive (keep learning rate as long as training loss is

decreasing, otherwise divide the current learning rate by 5)

Alpha (L2 penalty) 0.0001, 0.001, 0.01, and 0.1

85

Table B3. Modified inception scores of generated sequencing profiles varying the number of conditional

Wasserstein GANs.

# of GANs RNA-seq tumor

expression profile

WGS human gut

microbiome profile

1 1.0764 1.1923

2 1.0745 1.2046

3 1.0745 1.2079

4 1.0746 1.2048

5 1.0779 1.2060

6 1.0780 1.2061

7 1.0771 1.2071

8 1.0779 1.2042

9 1.0778 1.2031

10 1.0761 1.2035

86

Appendix C

Contents

• Table C1. Hyper-parameter grid for optimizing classifiers

• Table C2. AUC of the classifiers trained with different approaches

87

Table C1. Hyper-parameter grid for optimizing classifiers

Classification

algorithm Hyper-parameter Parameter grid

SVM

Kernel Linear and radial basis function (RBF)

Regularization penalty

C 2-4, 2-3, 2-2, 2-1, 20, 21, 22, and 24

Gamma ‘Scale’ (= 1/(n_features*X.var()) and

‘Auto’ (=1/n_features)

RF

# of estimators 27, 28, 29, and 210

Maximum # of features

for the best split Square root and log2 of n_features

Split criterion Gini impurity and information gain

NN

Hidden layers

(hidden units)

3 layers (128, 64, 32),

4 layers (128, 64, 32, 16), and

5 layers (128, 64, 32, 16, 8)

Learning rate

Constant (0.001),

invscaling (0.001/ pow(t, power_t) where t is time step),

and

adaptive (keep learning rate as long as training loss is

decreasing, otherwise divide the current learning rate by 5)

Alpha (L2 penalty) 0.0001, 0.001, 0.01, and 0.1 - SVM: support vector machine; RF: random forest; NN: feedforward neural network

88

Table C2. AUC of the classifiers trained with different approaches

Approach Limeta

et al. No FS FS only FS + AE

DeepGeni

(FS + DBG + AE)

Algorithm RF SVM RF NN SVM RF NN SVM RF NN SVM RF NN

AUC 0.624 0.667 0.543 0.531 0.673 0.574 0.679 0.698 0.673 0.605 0.744 0.673 0.772

- FS: feature selection; AE: autoencoder; DBG: DeepBioGen

89

References

1. Schork, N.J., Personalized medicine: time for one-person trials. Nature, 2015. 520(7549): p. 609-

611.

2. Collins, F.S. and H. Varmus, A new initiative on precision medicine. New England journal of

medicine, 2015. 372(9): p. 793-795.

3. Council, N.R., Toward precision medicine: building a knowledge network for biomedical research

and a new taxonomy of disease. 2011: National Academies Press.

4. Ashley, E.A., et al., Clinical assessment incorporating a personal genome. The Lancet, 2010.

375(9725): p. 1525-1535.

5. Worthey, E.A., et al., Making a definitive diagnosis: successful clinical application of whole exome

sequencing in a child with intractable inflammatory bowel disease. Genetics in Medicine, 2011.

13(3): p. 255-262.

6. Ashley, E.A., Towards precision medicine. Nature Reviews Genetics, 2016. 17(9): p. 507.

7. Lin, E. and H.-Y. Lane, Machine learning and systems genomics approaches for multi-omics data.

Biomarker research, 2017. 5(1): p. 2.

8. Xie, B., et al., MOBCdb: a comprehensive database integrating multi-omics data on breast cancer

for precision medicine. Breast cancer research and treatment, 2018. 169(3): p. 625-632.

9. LeCun, Y., Y. Bengio, and G. Hinton, Deep learning. nature, 2015. 521(7553): p. 436-444.

10. LeCun, Y. and M. Ranzato. Deep learning tutorial. in Tutorials in International Conference on

Machine Learning (ICML’13). 2013. Citeseer.

11. Min, S., B. Lee, and S. Yoon, Deep learning in bioinformatics. Briefings in bioinformatics, 2017.

18(5): p. 851-869.

12. Miotto, R., et al., Deep learning for healthcare: review, opportunities and challenges. Briefings in

bioinformatics, 2018. 19(6): p. 1236-1246.

13. Ravì, D., et al., Deep learning for health informatics. IEEE journal of biomedical and health

informatics, 2016. 21(1): p. 4-21.

14. Russakovsky, O., et al., Imagenet large scale visual recognition challenge. International journal of

computer vision, 2015. 115(3): p. 211-252.

15. Hannouf, M., et al., Cost-effectiveness of using a gene expression profiling test to aid in identifying

the primary tumour in patients with cancer of unknown primary. The pharmacogenomics journal,

2017. 17(3): p. 286-300.

16. Street, W., Cancer Facts & Figures 2019. Am. Cancer Soc, 2018. 76.

90

17. Cho, I. and M.J. Blaser, The human microbiome: at the interface of health and disease. Nature

Reviews Genetics, 2012. 13(4): p. 260.

18. Huttenhower, C., et al., Structure, function and diversity of the healthy human microbiome. nature,

2012. 486(7402): p. 207.

19. McQuade, J.L., et al., Modulating the microbiome to improve therapeutic response in cancer. The

Lancet Oncology, 2019. 20(2): p. e77-e91.

20. Eloe-Fadrosh, E.A. and D.A. Rasko, The human microbiome: from symbiosis to pathogenesis.

Annual review of medicine, 2013. 64: p. 145-163.

21. Hamady, M. and R. Knight, Microbial community profiling for human microbiome projects: tools,

techniques, and challenges. Genome research, 2009. 19(7): p. 1141-1152.

22. Scholz, M., et al., Strain-level microbial epidemiology and population genomics from shotgun

metagenomics. Nature methods, 2016. 13(5): p. 435.

23. Truong, D.T., et al., MetaPhlAn2 for enhanced metagenomic taxonomic profiling. Nature methods,

2015. 12(10): p. 902-903.

24. Kramer, M.A., Nonlinear principal component analysis using autoassociative neural networks.

AIChE journal, 1991. 37(2): p. 233-243.

25. Nguyen, T.H., et al., Deep learning for metagenomic data: using 2d embeddings and convolutional

neural networks. arXiv preprint arXiv:1712.00244, 2017.

26. Nguyen, T.H., et al., Disease classification in metagenomics with 2d embeddings and deep learning.

arXiv preprint arXiv:1806.09046, 2018.

27. Pasolli, E., et al., Machine learning meta-analysis of large metagenomic datasets: tools and

biological insights. PLoS computational biology, 2016. 12(7): p. e1004977.

28. Cawley, G.C. and N.L. Talbot, On over-fitting in model selection and subsequent selection bias in

performance evaluation. Journal of Machine Learning Research, 2010. 11(Jul): p. 2079-2107.

29. Varma, S. and R. Simon, Bias in error estimation when using cross-validation for model selection.

BMC bioinformatics, 2006. 7(1): p. 91.

30. Qin, J., et al., A human gut microbial gene catalogue established by metagenomic sequencing.

nature, 2010. 464(7285): p. 59.

31. Karlsson, F.H., et al., Gut metagenome in European women with normal, impaired and diabetic

glucose control. Nature, 2013. 498(7452): p. 99-103.

32. Qin, J., et al., A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature,

2012. 490(7418): p. 55-60.

91

33. Le Chatelier, E., et al., Richness of human gut microbiome correlates with metabolic markers.

Nature, 2013. 500(7464): p. 541.

34. Qin, N., et al., Alterations of the human gut microbiome in liver cirrhosis. Nature, 2014. 513(7516):

p. 59.

35. Zeller, G., et al., Potential of fecal microbiota for early‐stage detection of colorectal cancer.

Molecular systems biology, 2014. 10(11): p. 766.

36. Glorot, X. and Y. Bengio. Understanding the difficulty of training deep feedforward neural

networks. in Proceedings of the thirteenth international conference on artificial intelligence and

statistics. 2010.

37. Kingma, D.P. and M. Welling, Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114,

2013.

38. Li, F., H. Qiao, and B. Zhang, Discriminatively boosted image clustering with fully convolutional

auto-encoders. Pattern Recognition, 2018. 83: p. 161-173.

39. Krizhevsky, A., I. Sutskever, and G.E. Hinton. Imagenet classification with deep convolutional

neural networks. in Advances in neural information processing systems. 2012.

40. Kingma, D.P. and J. Ba, Adam: A method for stochastic optimization. arXiv preprint

arXiv:1412.6980, 2014.

41. Cortes, C. and V. Vapnik, Support-vector networks. Machine learning, 1995. 20(3): p. 273-297.

42. Pearson, K., LIII. On lines and planes of closest fit to systems of points in space. The London,

Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 1901. 2(11): p. 559-572.

43. Bingham, E. and H. Mannila. Random projection in dimensionality reduction: applications to

image and text data. in Proceedings of the seventh ACM SIGKDD international conference on

Knowledge discovery and data mining. 2001. ACM.

44. Dasgupta, S., Experiments with random projection. arXiv preprint arXiv:1301.3849, 2013.

45. Dasgupta, S. and A. Gupta, An elementary proof of the Johnson-Lindenstrauss lemma. International

Computer Science Institute, Technical Report, 1999. 22(1): p. 1-5.

46. Saito, T. and M. Rehmsmeier, The precision-recall plot is more informative than the ROC plot

when evaluating binary classifiers on imbalanced datasets. PloS one, 2015. 10(3): p. e0118432.

47. Mazurowski, M.A., et al., Training neural network classifiers for medical decision making: The

effects of imbalanced datasets on classification performance. Neural networks, 2008. 21(2-3): p.

427-436.

48. Baker, M., 1,500 scientists lift the lid on reproducibility. Nature, 2016. 533(7604).

92

49. Bernau, C., et al., Cross-study validation for the assessment of prediction algorithms.

Bioinformatics, 2014. 30(12): p. i105-i112.

50. Castaldi, P.J., I.J. Dahabreh, and J.P. Ioannidis, An empirical assessment of validation practices for

molecular classifiers. Briefings in bioinformatics, 2011. 12(3): p. 189-202.

51. Collins, F.S. and L.A. Tabak, Policy: NIH plans to enhance reproducibility. Nature, 2014.

505(7485): p. 612-613.

52. Mattsson-Carlgren, N., et al., Increasing the reproducibility of fluid biomarker studies in

neurodegenerative studies. Nature communications, 2020. 11(1): p. 1-11.

53. Leek, J.T., et al., Tackling the widespread and critical impact of batch effects in high-throughput

data. Nature Reviews Genetics, 2010. 11(10): p. 733-739.

54. Ganin, Y., et al., Domain-adversarial training of neural networks. The Journal of Machine

Learning Research, 2016. 17(1): p. 2096-2030.

55. Hoffman, J., et al. Cycada: Cycle-consistent adversarial domain adaptation. in Proceedings of the

International Conference on Machine Learning 2018. ICML.

56. Saenko, K., et al. Adapting visual category models to new domains. in Proceedings of the European

Conference on Computer Vision. 2010. ECCV.

57. Li, H., et al. Domain generalization with adversarial feature learning. in Proceedings of the IEEE

Conference on Computer Vision and Pattern Recognition. 2018. CVPR.

58. Li, Y., et al. Deep domain generalization via conditional invariant adversarial networks. in

Proceedings of the European Conference on Computer Vision. 2018. ECCV.

59. Matsuura, T. and T. Harada. Domain Generalization Using a Mixture of Multiple Latent Domains.

in Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence. 2020. AAAI.

60. Carlucci, F.M., et al. Domain generalization by solving jigsaw puzzles. in Proceedings of the IEEE

Conference on Computer Vision and Pattern Recognition. 2019. CVPR.

61. Li, D., et al. Learning to generalize: Meta-learning for domain generalization. in Proceedings of

the Thirty-Second AAAI Conference on Artificial Intelligence. 2018. AAAI.

62. Shankar, S., et al. Generalizing Across Domains via Cross-Gradient Training. in Proceedings of

the International Conference on Learning Representations. 2018. ICLR.

63. Volpi, R., et al. Generalizing to unseen domains via adversarial data augmentation. in Proceedings

of the 32nd International Conference on Neural Information Processing Systems. 2018.

64. Antoniou, A., A. Storkey, and H. Edwards, Data augmentation generative adversarial networks.

arXiv preprint arXiv:1711.04340, 2017.

93

65. Wong, S.C., et al. Understanding data augmentation for classification: when to warp? in

Proceedings of the International Conference on Digital Image Computing: techniques and

applications. 2016. IEEE DICTA.

66. Zhang, X., et al. Dada: Deep adversarial data augmentation for extremely low data regime

classification. in Proccedings of the International Conference on Acoustics, Speech and Signal

Processing. 2019. IEEE ICASSP.

67. Goodfellow, I., et al., Generative adversarial nets. Advances in neural information processing

systems, 2014. 27: p. 2672-2680.

68. Calimeri, F., et al. Biomedical data augmentation using generative adversarial neural networks. in

International conference on artificial neural networks. 2017. Springer.

69. Sandfort, V., et al., Data augmentation using generative adversarial networks (CycleGAN) to

improve generalizability in CT segmentation tasks. Scientific reports, 2019. 9(1): p. 1-9.

70. Madani, A., et al. Chest x-ray generation and data augmentation for cardiovascular abnormality

classification. in Proccedings of the International Society for Optics and Photonics. 2018.

71. Marouf, M., et al., Realistic in silico generation and augmentation of single-cell RNA-seq data

using generative adversarial networks. Nature communications, 2020. 11(1): p. 1-12.

72. Emilsson, V., et al., Genetics of gene expression and its effect on disease. Nature, 2008. 452(7186):

p. 423-428.

73. Jiang, P., et al., Signatures of T cell dysfunction and exclusion predict cancer immunotherapy

response. Nature medicine, 2018. 24(10): p. 1550-1558.

74. Auslander, N., et al., Robust prediction of response to immune checkpoint blockade therapy in

metastatic melanoma. Nature medicine, 2018. 24(10): p. 1545-1549.

75. Oh, M. and L. Zhang, DeepMicro: deep representation learning for disease prediction based on

microbiome data. Scientific reports, 2020. 10(1): p. 1-9.

76. Reynolds, D.A., T.F. Quatieri, and R.B. Dunn, Speaker verification using adapted Gaussian

mixture models. Digital signal processing, 2000. 10(1-3): p. 19-41.

77. Chawla, N.V., et al., SMOTE: synthetic minority over-sampling technique. Journal of artificial

intelligence research, 2002. 16: p. 321-357.

78. Gide, T.N., et al., Distinct immune cell populations define response to anti-PD-1 monotherapy and

anti-PD-1/anti-CTLA-4 combined therapy. Cancer cell, 2019. 35(2): p. 238-255. e6.

79. Thorndike, R.L., Who belongs in the family? Psychometrika, 1953. 18(4): p. 267-276.

80. Maaten, L.v.d. and G. Hinton, Visualizing data using t-SNE. Journal of machine learning research,

2008. 9(Nov): p. 2579-2605.

94

81. Hugo, W., et al., Genomic and transcriptomic features of response to anti-PD-1 therapy in

metastatic melanoma. Cell, 2016. 165(1): p. 35-44.

82. Riaz, N., et al., Tumor and microenvironment evolution during immunotherapy with nivolumab.

Cell, 2017. 171(4): p. 934-949. e16.

83. Geurts, P., D. Ernst, and L. Wehenkel, Extremely randomized trees. Machine learning, 2006. 63(1):

p. 3-42.

84. Arjovsky, M., S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. in

International conference on machine learning. 2017. PMLR.

85. Gulrajani, I., et al., Improved training of wasserstein gans. arXiv preprint arXiv:1704.00028, 2017.

86. Ben-David, S., et al., Analysis of representations for domain adaptation. Advances in neural

information processing systems, 2007. 19: p. 137.

87. Kifer, D., S. Ben-David, and J. Gehrke. Detecting change in data streams. in VLDB. 2004. Toronto,

Canada.

88. Salimans, T., et al. Improved techniques for training GANs. in Proceedings of the 30th

International Conference on Neural Information Processing Systems. 2016.

89. Gurumurthy, S., R. Kiran Sarvadevabhatla, and R. Venkatesh Babu. Deligan: Generative

adversarial networks for diverse and limited data. in Proceedings of the IEEE conference on

computer vision and pattern recognition. 2017.

90. Barratt, S. and R. Sharma, A note on the inception score. arXiv preprint arXiv:1801.01973, 2018.

91. Gopalakrishnan, V., et al., Gut microbiome modulates response to anti–PD-1 immunotherapy in

melanoma patients. Science, 2018. 359(6371): p. 97-103.

92. Matson, V., et al., The commensal microbiome is associated with anti–PD-1 efficacy in metastatic


93. Routy, B., et al., Gut microbiome influences efficacy of PD-1–based immunotherapy against

epithelial tumors. Science, 2018. 359(6371): p. 91-97.

94. Marcus, L., et al., FDA approval summary: pembrolizumab for the treatment of microsatellite

instability-high solid tumors. Clinical Cancer Research, 2019. 25(13): p. 3753-3758.

95. Baruch, E.N., et al., Fecal microbiota transplant promotes response in immunotherapy-refractory


96. Davar, D., et al., Fecal microbiota transplant overcomes resistance to anti–PD-1 therapy in


97. Shaikh, F.Y., J.J. Gills, and C.L. Sears, Impact of the microbiome on checkpoint inhibitor treatment

in patients with non-small cell lung cancer and melanoma. EBioMedicine, 2019. 48: p. 642-647.

95

98. Chaput, N., et al., Baseline gut microbiota predicts clinical response and colitis in metastatic

melanoma patients treated with ipilimumab. Annals of Oncology, 2017. 28(6): p. 1368-1379.

99. Frankel, A.E., et al., Metagenomic shotgun sequencing and unbiased metabolomic profiling

identify specific human gut microbiota and metabolites associated with immune checkpoint therapy

efficacy in melanoma patients. Neoplasia, 2017. 19(10): p. 848-855.

100. Vétizou, M., et al., Anticancer immunotherapy by CTLA-4 blockade relies on the gut microbiota.

Science, 2015. 350(6264): p. 1079-1084.

101. Limeta, A., et al., Meta-analysis of the gut microbiota in predicting response to cancer

immunotherapy in metastatic melanoma. JCI insight, 2020. 5(23).

102. Wang, J., et al., Generalizing to Unseen Domains: A Survey on Domain Generalization. arXiv

preprint arXiv:2103.03097, 2021.

103. Cammarota, G., et al., Gut microbiome, big data and machine learning to promote precision

medicine for cancer. Nature Reviews Gastroenterology & Hepatology, 2020. 17(10): p. 635-648.

104. Wilkinson, J., et al., Time to reality check the promises of machine learning-powered precision

medicine. The Lancet Digital Health, 2020.

105. Wang, F., R. Kaushal, and D. Khullar, Should health care demand interpretable artificial

intelligence or accept “black box” medicine? 2020, American College of Physicians.

106. Svensson, V., et al., Interpretable factor models of single-cell RNA-seq via variational

autoencoders. Bioinformatics, 2020. 36(11): p. 3418-3421.

107. Peters, B.A., et al., Relating the gut metagenome and metatranscriptome to immunotherapy

responses in melanoma patients. Genome medicine, 2019. 11(1): p. 1-14.

108. Eisenhauer, E.A., et al., New response evaluation criteria in solid tumours: revised RECIST

guideline (version 1.1). European journal of cancer, 2009. 45(2): p. 228-247.

109. Chen, S., et al., fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics, 2018. 34(17):

p. i884-i890.

110. Milanese, A., et al., Microbial abundance, activity and population genomic profiling with mOTUs2.

Nature communications, 2019. 10(1): p. 1-11.

deep learning for enhancing precision medicine min oh

Documents