the university of chicago visual representation …

30
THE UNIVERSITY OF CHICAGO VISUAL REPRESENTATION LEARNING IN GENOMICS A DISSERTATION SUBMITTED TO THE FACULTY OF THE DIVISION OF THE THESIS DIVISION IN CANDIDACY FOR THE DEGREE OF TYPE OF DEGREE DEPARTMENT OF THESIS DEPARTMENT BY THESIS AUTHOR CHICAGO, ILLINOIS GRADUATION DATE

Upload: others

Post on 31-May-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: THE UNIVERSITY OF CHICAGO VISUAL REPRESENTATION …

THE UNIVERSITY OF CHICAGO

VISUAL REPRESENTATION LEARNING IN GENOMICS

A DISSERTATION SUBMITTED TO

THE FACULTY OF THE DIVISION OF THE THESIS DIVISION

IN CANDIDACY FOR THE DEGREE OF

TYPE OF DEGREE

DEPARTMENT OF THESIS DEPARTMENT

BY

THESIS AUTHOR

CHICAGO, ILLINOIS

GRADUATION DATE

Page 2: THE UNIVERSITY OF CHICAGO VISUAL REPRESENTATION …

Copyright c© 2020 by Thesis Author

All Rights Reserved

Page 3: THE UNIVERSITY OF CHICAGO VISUAL REPRESENTATION …

TABLE OF CONTENTS

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

1 PROBLEM STATEMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.1 Genomic Islands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.1.1 Background and motivation . . . . . . . . . . . . . . . . . . . . . . . 22.1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.1.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Operons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2.1 Background and Related Work . . . . . . . . . . . . . . . . . . . . . 62.2.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 METHODS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.1 From Genes to Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2 Why Images? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.3 Deep Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4 SUMMARY OF WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

iii

Page 4: THE UNIVERSITY OF CHICAGO VISUAL REPRESENTATION …

LIST OF FIGURES

3.1 Examples of images generated from the compare region viewer. Each directedarrow represents a gene color coded to match its functionality. The first rowis the genome neighborhood of the focus gene (red), and the subsequent rowsrepresent anchored regions from similar genomes sorted by their phylogeneticdistances to the query genome. . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2 Example of an image generated by our service to be fed as input to the neuralnetwork. Each arrow represents a single gene. Each row captures the area ofinterest in a genome. The query genome is the top row. The rest of the rows aregenomes selected by evolutionary distance. The query gene pair are color-codedwith blue, and the inter-genetic distance between them is colored red. The restof the genes share the same color if they belong to the same family. . . . . . . . 12

3.3 Architecture of the ResNet18 Neural Network used to predict operons. . . . . . 14

iv

Page 5: THE UNIVERSITY OF CHICAGO VISUAL REPRESENTATION …

LIST OF TABLES

v

Page 6: THE UNIVERSITY OF CHICAGO VISUAL REPRESENTATION …

ACKNOWLEDGMENTS

vi

Page 7: THE UNIVERSITY OF CHICAGO VISUAL REPRESENTATION …

ABSTRACT

The advent of available data on the rise, and the recent breakthroughs in hardware and soft-

ware technologies over the past years, helped Machine Learning(ML) establish itself as the

golden standard across many domains, with its applications’ performance beating previous

state of the art methods. One of the areas where this is directly noticeable and that receives

significant attention justified by great improvements in the field, is computer vision. In this

thesis, We demonstrate that even problems that are not intuitively identified as part of the

computer vision domain, can leverage the computer vision technologies by being mapped

into the domain space. In addition, and through a technique called transfer learning, deep

neural networks that are powering the advances in computer vision applications can be used

in areas that suffer from having limited datasets and lack of ground-truth data, which is

the cornerstone for any supervised learning task. Transfer learning works by training these

deep models first on large - even if irrelevant- datasets, and then retraining(usually the last

layers) on the limited datasets corresponding to the application of interest. For example, a

deep neural network can first be trained on Imagenet, a well-known dataset that contains

more than a million images, and then re-trained on any other image dataset over which

classification is to be performed. This approach has been shown to be superior over training

a model starting with random weights. It also alleviates the risk of overfitting and makes

it possible to use deep models when the datasets are extremely limited. Even with genomic

data being constantly on the rise, having meaningful and labeled data is a constant chal-

lenge, which makes supervised learning applications on some bioinformatics problems more

infeasible. There are numerous problems that suffer from limited datasets in bioinformatics,

we demonstrate the effectiveness of transfer learning on two of these problems, after mapping

them to computer vision tasks. We further show that the results we achieve are better than

those reported by state of the art methods, that rely mostly on tabular data. The two prob-

lems we present are identifying genomic islands, and operons, in bacterial genomes. These

are two well-known genome annotation problems, known for being experimentally costly and

vii

Page 8: THE UNIVERSITY OF CHICAGO VISUAL REPRESENTATION …

to suffer from the lack of extensive verified data-sets. We conclude this research with a con-

jecture of why these problems perform better when represented as image-related problems,

than their tabular counterparts, even in cases where the underlying data is supposedly the

same.

viii

Page 9: THE UNIVERSITY OF CHICAGO VISUAL REPRESENTATION …

CHAPTER 1

PROBLEM STATEMENT

In this research, we demonstrate the effectiveness and superiority of using computer vision

technologies on visual representations of tabular data that may not be intuitively repre-

sented as images. Such an advantage may owe its favor in part to the superior models that

become accessible once the underlying problem is translated into the domain of computer

vision. These deep models can be utilized through a technique called transfer learning. Such

an approach becomes very important in areas where verified datasets are limited, even if

the problems may not intuitively be those of computer vision. Biology offers a myriad of

such problems. In this thesis, we focus on two problems: identifying genomic islands and

identifying operons in prokaryotic genomes. We present our methods (Shutter Island for

Genomic Islands, Operon Hunter for operons), that use a deep learning model widely used

in computer vision, to detect genomic islands and operons respectively. The intrinsic value

of deep learning methods lies in their ability to generalize. Via transfer learning, the model

is pre-trained on a large generic dataset and then re-trained on images that we generate to

represent genomic fragments. Using manually verified reference datasets, we demonstrate

that this image-based approach generalizes better and is superior to the existing tools. We

argue that this approach could also be applied to other problems in the domain where data

is lacking, and end with a discussion of why mapping tabular data into images could be

producing better results.

1

Page 10: THE UNIVERSITY OF CHICAGO VISUAL REPRESENTATION …

CHAPTER 2

INTRODUCTION

2.1 Genomic Islands

2.1.1 Background and motivation

Horizontal gene transfer is the main source of adaptability for bacteria, through which genes

are obtained from different sources including bacteria, archaea, viruses, and eukaryotes. This

process promotes the rapid spread of genetic information across lineages, typically in the form

of clusters of genes referred to as genomic islands (GIs). Different types of GIs exist, often

classified by the content of their cargo genes or their means of integration and mobility. A

genomic island (GI) then is a cluster of genes that is typically between 10 kb and 200 kb in

length and has been transferred horizontally [11].

Horizontal gene transfer (HGT) may contribute to anywhere between 1.6% and 32.6% of

the genomes [12, 13, 14, 15, 16, 17, 18, 19, 20]. This percentage implies that a major factor

in the variability across bacterial species and clades can be attributed to GIs [21]. Thus,

they impose an additional challenge to our ability to reconstruct the evolutionary tree of

life. The identification of GIs is also important for the advancement of medicine, by helping

develop new vaccines and antibiotics[22, 23] or cancer therapies[24]. For example, knowing

that PAIs can carry many pathogenicity genes and virulence genes [25, 26, 27], researchers

found that potential vaccine candidates resided within PAIs [28]. Various computational

methods have been devised to detect different types of GIs, but no single method currently

is capable of detecting all GIs.

2.1.2 Related Work

While early methods focused on manual inspection of disrupted genomic regions that may

resemble GI attachment sites [29] or show unusual nucleotide content [30, 31], the most

2

Page 11: THE UNIVERSITY OF CHICAGO VISUAL REPRESENTATION …

recent computational methods fall into two broad categories: methods that involve sequence

composition and methods that involve comparative genomics [1]. They both focus on one or

more of the features that make GIs distinct, such as compositional bias, mobility elements,

and transfer RNA (tRNA) hotspots [5, 7, 11, 25, 26, 32, 33, 34]. We discuss some of these

features in more detail; they are listed by decreasing order of importance in [1, 35].

• One of the most important features of GIs is that they are sporadically distributed;

that is, they are found only in certain isolates from a given strain or species.

• Since GIs are transferred horizontally across lineages and since different bacterial lin-

eages have different sequence compositions, measures such as GC content or, more gen-

erally, oligonucleotides of various lengths (usually 2–9 nucleotides) are being used [36,

37, 38, 39, 40, 41, 42]. Codon usage is a well-known metric, which is the special case

of oligonucleotides of length 3.

• Since the probability of having outlying measurements decreases as the size of the

region increases, tools usually use cut-off values for the minimum size of a region (or

gene cluster) to be identified as a GI.

• Another type of evidence comes not from the attachment sites but from what is in

between, since some genes (e.g., integrases, transposases, phage genes) are known to

be associated with GIs [25].

• In addition to the size of the cluster, evidence from mycobacterial phages [43] suggests

that the size of the genes themselves is shorter in GIs than in the rest of the bacterial

genome. Different theories suggest that this may confer mobility or packaging or

replication advantages [8].

• Some GIs integrate specifically into genomic sites such as tRNA genes, introducing

flanking direct repeats. Thus, the presence of such sites and repeats may be used as

evidence for the presence of GIs [45, 46, 47].

3

Page 12: THE UNIVERSITY OF CHICAGO VISUAL REPRESENTATION …

Other research suggests that the directionality of the transcriptional strand and the protein

length are key features in GI prediction [8]. The available tools focus on one or more of the

mentioned features.

Islander works by first identifying tRNA and transfer-messenger RNA genes and their

fragments as endpoints to candidate regions, then disqualifying candidates through a set of

filters such as sequence length and the absence of an integrase gene [3]. IslandPick identifies

GIs by comparing the query genome with a set of related genomes selected by an evolutionary

distance function [44]. It uses Blast and Mauve for the genome alignment. The outcome

heavily depends on the choice of reference genomes selected. Phaster uses BLAST against

a phage-specific sequence database (the NCBI phage database and the database developed

by Srividhya et al. [48]), followed by DBSCAN [49] to cluster the hits into prophage regions.

IslandPath-DIMOB considers a genomic fragment to be an island if it contains at least one

mobility gene, in addition to 8 or more consecutive open reading frames with di-nucleotide

bias [50]. SIGI-HMM uses the Viterbi algorithm to analyze each gene’s most probable

codon usage states, comparing it against codon tables representing microbial donors or highly

expressed genes, and classifying it as native or non-native accordingly [51]. PAI-IDA uses the

sequence composition features, namely, GC content, codon usage, and dinucleotide frequency,

to detect GIs [52]. Alien Hunter (or IVOM) uses k-mers of variable length to perform its

analysis, assigning more weight to longer k-mers [38]. Phispy uses random forests to classify

windows based on features that include transcription strand directionality, customized AT

and GC skew, protein length, and abundance of phage words [8]. Phage Finder classifies 10

kb windows with more than 3 bacteriophage-related proteins as GIs [53]. IslandViewer is an

ensemble method that combines the results of three other tools—SIGI-HMM, IslandPath-

DIMOB, and IslandPick—into one web resource [54].

4

Page 13: THE UNIVERSITY OF CHICAGO VISUAL REPRESENTATION …

2.1.3 Challenges

No single tool is able to detect all GIs in all bacterial genomes [31]. Methods that narrow

their search to GIs that integrate under certain conditions, such as into tRNAs, miss out

on the other GIs. Similarly, not all GI regions exhibit atypical nucleotide content [22, 55].

Evolutionary events such as gene loss and genomic rearrangement [5] present more challenges.

Also, highly expressed genes (e.g., genes in ribosomal protein operons), or those having an

island host and donor that belong to the same or closely related species, or the fact that

amelioration would pressure even genes from distantly related genomes to adapt to the host

over time would lead to the host and the island exhibiting similar nucleotide composition [56]

and subsequently to false negatives [23]. Tools that use windows face difficulty in adjusting

their size: small sizes lead to large statistical fluctuation, whereas larger sizes result in low

resolution [57].

For comparative genomics methods, the outcomes depend strongly on the choice of

genomes used in the alignment process. Very distant genomes may lead to false positives,

and very close genomes may lead to false negatives. In general, the number of reported GIs

may differ across tools, because one large GI is often reported as a few smaller ones or vice

versa, also making it harder to detect end points and boundaries accurately, even with the

use of hidden Markov models by tools such as AlienHunter and SIGI-HMM.

Furthermore, no reliable GI dataset exists that can validate the predictions of all these

computational methods [38]. Although several databases exist, they usually cover only

specific types of GIs [Islander, PAIDB, ICEberg], which would flag as false positives any

extra predictions made by those tools. Moreover, as Nelson et al. state, “The reliability of

the databases has not been verified by any convincing biological evidence” [6].

5

Page 14: THE UNIVERSITY OF CHICAGO VISUAL REPRESENTATION …

2.2 Operons

2.2.1 Background and Related Work

In prokaryotes, contiguous genes are often arranged into operons. These operons usually

include metabolically or functionally related genes that are often grouped together on the

same strand and transcribed in the same polycistronic messenger RNA [58, 59, 60, 61, 62, 63,

64, 65, 66, 67, 68]. Genes in the same operon share a common promoter and terminator[69].

We reserve the definition of Operons as those including at least two genes, distinguishing

them from Transcription Units, which may include one or more genes[1]. It is estimated

that more than 50% of bacterial genes form operons[70]. Since operons are often formed

by genes that are related functionally or metabolically, predicting operons could help us

predict higher levels of gene organization and regulatory networks[58, 59, 65, 66, 67, 68, 71].

Such predictions could help annotate gene functions[72], aid in drug development[73], and

antibiotic resistance[74].

Different methods use different features of operons to make their predictions. Some meth-

ods focus on promoters and terminators and use Hidden Markov Models(HMMs) to make

their predictions[75, 76, 77], others rely on gene conservation information[78], while others

leverage functional relatedness between the genes instead[58, 79]. Perhaps the most promi-

nent features used in operon prediction are transcription direction and inter-genetic distances

[58, 66, 70, 81, 82, 84, 83, 85, 86, 87, 88, 89, 91]. The idea that adjacent genes that are con-

served across multiple genomes are likely to be co-transcribed makes gene conservation an-

other important feature[67, 92]. In addition to HMMs, different machine learning(ML) tech-

nologies are deployed for the predition of operons, such as neural networks[81, 82], support

vector machines [83], and decision tree-based classification[84]. Other tools utilize Bayesian

probabilities [85, 86, 87], genetic algorithms[88], and graph-theoretic techniques[86, 79].

Many tools perform their predictions on gene pairs, and then assemble their predictions

6

Page 15: THE UNIVERSITY OF CHICAGO VISUAL REPRESENTATION …

that are made on an entire genome into operons by combining contiguous pairs that are

labeled as operonic.[58, 60]. These tools are often benchmarked using two genomes whose

operons have been studied extensively, E. coli and B. subtilis[60].

Among the tools that predict operons, we focus on two methods. The first is reported

to have the highest accuracy among operon prediction tools[64]. It is based on an arti-

ficial neural network that uses inter-genetic distance and protein functional relationships.

The method uses gene neighborhood, fusion, co-occurrence and co-expression in addition to

protein-protein interaction and literature mining to generate a score that can be used as

input to their network[62, 64]. The predictions made by this method were compiled into

what is called the Prokaryotic Operon Database(ProOpDB)[65]. The second tool is called

the Database of Prokaryotic Operons (DOOR) and was ranked as the best operon predictor

among 14 other predictors[59], and was also reported by another study to have the second

highest accuracy after ProOpDB[64]. Their algorithm uses a combination of a non-linear

(decision-tree based) classifier and a linear (logistic function-based classifier) depending on

the number of experimentally validated operons that could be used in the training[59].

2.2.2 Challenges

Before delving into our method, it’s worth pointing out some of the challenges that un-

dermine operon prediction. Predictors that rely on features such as gene conservation or

functional assignment are limited in the sense that they could not be applied to genomes or

genomic fragments that lack these features. So while such predictors might perform well on

gene pairs that include the necessary features, their performance might drop considerably

when considering the whole genome[58]. Moreover, even though most methods validate their

results by comparing their predictions to experimentally verified operons, the fact that the

experimentally verified datasets are not available for all prokaryotes or that the datasets

7

Page 16: THE UNIVERSITY OF CHICAGO VISUAL REPRESENTATION …

used vary between studies poses extra challenges when trying to compare prediction meth-

ods. Brouwer et al tried to compare several methods on a uniform dataset and noticed

significant gap in the metrics reported. The drop was even higher when considering full

operons rather than gene pairs[64, 93]. Finally, Some methods include the testing dataset in

the training dataset which leads to reported accuracies that are significantly higher, whereas

the information flow would not be easily and readily transferable to other genomes[66].

8

Page 17: THE UNIVERSITY OF CHICAGO VISUAL REPRESENTATION …

CHAPTER 3

METHODS

3.1 From Genes to Images

PATRIC (the Pathosystems Resource Integration Center) is a bacterial bioinformatics re-

source center that we are part of (https://www.patricbrc.org)[94]. It provides researchers

with the tools necessary to analyze their private data and to compare it with public data.

PATRIC recently surpassed the 200,000 publicly sequenced genomes mark, ensuring that

enough genomes are available for effective comparative genomics studies. PATRIC provides

a compare region viewer service, where a query genome is aligned against a set of other

related genomes anchored at a specific focus gene. The service finds other genes that are of

the same family as the query gene and then aligns their flanking regions accordingly. Such

graphical representations are appealing because they help users visualize the genomic areas

of interest. In the resulting plots, genomic islands should appear as gaps in alignment as

opposed to conserved regions. We replicated the service by implementing an offline version

forked from the production user interface, which is necessary for computational efficiency

and for consistency in the face of any future interface changes.

Figure 2.1 shows sample visualizations of different genomic fragments belonging to the

two classes. Each row represents a region in a genome, with the query genome being the top

row. Each arrow represents a single gene, capturing its size and strand directionality. Colors

represent functionality. The red arrow represents the query gene, at which the alignment

with the rest of the genomes is anchored. The remaining genes share the same color if they

belong to the same family; they are colored black if they are not found in the query genome’s

focus region. Some colors are reserved for key genes: green for mobility genes, yellow for

tRNA genes, and blue for phage related genes.

Figures 2.1a and 2.1b are examples of a query genome with a non-conserved neighbor-

hood. The focus peg lacks alignments in general or is being aligned with genes from other

9

Page 18: THE UNIVERSITY OF CHICAGO VISUAL REPRESENTATION …

genomes with different neighborhoods, containing genes with different functionalities from

those in the query genome (functionality is color coded). In contrast, Figures 2.1c and 2.1d

show more conserved regions, which are what we expect to see in the absence of GIs.

Figure 3.1: Examples of images generated from the compare region viewer. Each directedarrow represents a gene color coded to match its functionality. The first row is the genomeneighborhood of the focus gene (red), and the subsequent rows represent anchored regionsfrom similar genomes sorted by their phylogenetic distances to the query genome.

10

Page 19: THE UNIVERSITY OF CHICAGO VISUAL REPRESENTATION …

3.2 Why Images?

Representing genomic fragments as images makes it easier to leverage the powerful machine

learning (ML) technologies that have become the state of art in solving computer vision

problems. Algorithms based on deep neural networks have proven to be superior in competi-

tions such as the ImageNet Large Scale Visual Recognition Challenge (ILSVRS) [95]. Deep

learning is the process of training neural networks with many hidden layers. The depth of

these networks allows them to learn more complex patterns and higher-order relationships,

at the cost of being more computationally expensive and requiring more data to work effec-

tively. Improvements in such algorithms have been translated to improvements in a variety

of domains reliant on computer vision tasks [96].

The images generated by PATRIC capture many of the most important GI features

mentioned earlier—the sporadic distribution of islands, the protein length, functionality,

and strand directionality—using color-coded arrows of various sizes.

For operons, the images generated are almost the same as the ones we generated for

genomic slands. The notable differences are the region size, that we reduced to 5(kilo base

pairs)KBp since the immediate vicinity is more relevant when dealing with operons. The

arrows and the distance between them are scaled on each row to reflect the actual length and

distance between the genes. Colors still represent functionality. The blue arrows are reserved

to represent the query gene pair. The rest of the genes are still color-coded by their family

membership. The families used in the coloring process are PATRIC’s Global Pattyfams that

are generated by mapping signature k-mers to protein functionality, using non-redundant

protein databases built per genus before being transferred across genera[97]. In addition, we

highlight the inter-genetic distance between the query gene pair by filling the gap as a red

rectangle, and removed the arrows from all rows except the first row representing the query

genome, since that is the only directionality we care about. This way, the generated images

capture most of the prominent operon features mentioned earlier, such as gene conservation,

functionality, strand direction, size, and inter-genetic distance. Figure 3.1 shows an example

11

Page 20: THE UNIVERSITY OF CHICAGO VISUAL REPRESENTATION …

of how the generated images look like.

Figure 3.2: Example of an image generated by our service to be fed as input to the neuralnetwork. Each arrow represents a single gene. Each row captures the area of interest in agenome. The query genome is the top row. The rest of the rows are genomes selected byevolutionary distance. The query gene pair are color-coded with blue, and the inter-geneticdistance between them is colored red. The rest of the genes share the same color if theybelong to the same family.

3.3 Deep Transfer Learning

Training deep models over a limited dataset puts the model at the risk of overfitting. One

way around this problem is using a technique referred to as transfer learning[98]. In transfer

learning, a model does not have to be trained from scratch. Instead, the idea is to retrain

a model that has been previously trained on a related task. The newly retrained model

should then be able to transfer its existing knowledge and apply it to the new task. This

approach affords the ability to reuse models that have been trained on huge amounts of data,

while adding the necessary adjustments to make them available to work with more limited

datasets, adding a further advantage to our approach of representing the data visually.

In our approach (Shutter Island), we use Google’s Inception V3 architecture that has

12

Page 21: THE UNIVERSITY OF CHICAGO VISUAL REPRESENTATION …

been previously trained on ImageNet. The Inception V3 architecture is a 48-layer-deep

convolutional neural network [96]. Training such a deep network on a limited dataset such

as the one available for GIs is unlikely to produce good results. ImageNet is a database that

contains more than a million images belonging to more than a thousands categories [95].

The ImageNet project runs the ILSVRC dataset annually. The Inception V3 model reaches

a 3.5% top-5 error rate on the 2012 ILSVRC dataset, where the winning model that year

had a 15.3% error rate. Thus, a model that was previously trained on ImageNet is already

good at feature extraction and visual recognition. To make the model compatible with the

new task, the top layer of the network is retrained on our GI dataset, while the rest of the

network is left intact, a strategy that is more powerful than starting with a deep network

with random weights.

For Operon Hunter, in addition to the mentioned changes in the generated images,

some changes were also introduced on the training part. Namely, we switched to using

the FastAI[99] framework and tested the performance of the different available pre-trained

models on our dataset. We observed the best performance when using the ResNet18 model,

whose architecture is displayed in Figure 3.2. We also found better results when we changed

the mode of transfer learning. Specifically, instead of only re-training the last layer of the

network while freezing the weights in the previous layers, we noticed better results when the

previous layers’ weights were unfrozen.

In addition to using powerful technologies and extensive data, this approach may add an

extra advantage over whole-genome alignment methods because performing the alignment

over each gene may provide a higher local resolution and aid in resisting evolutionary effects

such as recombination and others that may have happened after the integration and that

usually affect GI detection efforts.

13

Page 22: THE UNIVERSITY OF CHICAGO VISUAL REPRESENTATION …

Figure 3.3: Architecture of the ResNet18 Neural Network used to predict operons.

14

Page 23: THE UNIVERSITY OF CHICAGO VISUAL REPRESENTATION …

CHAPTER 4

SUMMARY OF WORK

So far, we successfully mapped two problems from biology to qualify as computer vision

problems. To do that, we implemented tools that represent genomic fragments as images,

with genes as arrows and different rows capturing genome alignments. We successfully re-

trained a well known computer vision deep learning neural network on the resulting image

datasets related to these problems, and demonstrated the superiority of this approach over

the existing state of the art methods that rely on tabular data. To do that, we successfully

curated the appropriate datasets and devised extensive validation methods to perform a uni-

form comparison across other tools, which was needed especially given the existing seemingly

random nature of validation in the literature. We are currently exploring the benefits of this

image-based approach when compared to tabular-based ones, especially in situations where

the former is counter-intuitive and hope to come up with a conjecture of what may be the

reason for the superior performance.

15

Page 24: THE UNIVERSITY OF CHICAGO VISUAL REPRESENTATION …

REFERENCES

[1] Langille, M., Hsiao, W., Brinkman, F. (2010) Detecting genomic islands using bioinfor-matics approaches. Nature Reviews Microbiology, 8(5), 373–382.

[2] Hacker, J., et al. (1990) Deletions of chromosomal regions coding for fimbriae andhemolysins occur in vitro and in vivo in various extraintestinal Escherichia coli isolates.Microb. Pathog, 8, 213—225.

[3] Hudson, C., Lau, B., Williams, K. (2014) Islander: a database of precisely mappedgenomic islands in tRNA and tmRNA genes. Nucleic Acids Research, 43(D1), D48–D53.

[4] Barondess, J.J., Beckwith, J. (1990) A bacterial virulence determinant encoded by lyso-genic coliphage lambda. Nature, 346, 871—874.

[5] Dobrindt, U., Hochhut, B., Hentschel, U., Hacker, J. (2004) Genomic islands inpathogenic and environmental microorganisms. Nat. Rev. Microbiol., 2, 414—424.

[6] Lu, B., Leong, H. (2016) Computational methods for predicting genomic islands in mi-crobial genomes. Computational And Structural Biotechnology Journal, 14, 200–206.

[7] Juhas, M., van der Meer, J.R., Gaillard, M., Hood, D.W., et al. (2009) Genomic islands:tools of bacterial horizontal gene transfer and evolution. FEMS Microbiol. Rev., 33,376—3793.

[8] Akhter, S., Aziz, R., Edwards, R. (2012) PhiSpy: a novel algorithm for finding prophagesin bacterial genomes that combines similarity- and composition-based strategies. NucleicAcids Research, 40(16), e126–e126.

[9] Fogg, P., Colloms, S., Rosser, S., Stark, M., Smith, M. (2014) New applications for phageintegrases. Journal of Molecular Biology, 426(15), 2703–2716.

[10] Hambly, E., Suttle, C.A.. (2005) The viriosphere, diversity, and genetic exchange withinphage communities. Curr. Opin. Microbiol., 8, 444-–50.

[11] Hacker, J., Kaper, J.B. (2000) Pathogenicity islands and the evolution of microbes.Annu. Rev. Microbiol., 54, 641—679.

[12] Liu, H., Zhu, J. (2010) Analysis of the 3’ ends of tRNA as the cause of insertion sites offoreign DNA in Prochlorococcus. Journal Of Zhejiang University SCIENCE B, 11(9),708–718.

[13] Nelson, K.E., et al. (1999) Evidence for lateral gene transfer between archaea and bac-teria from genome sequence of Thermotoga maritime. Nature, 399(6734), 323-329.

[14] Garcia-Vallve, V.S., Romeu, A., Palau, J. (2000) Horizontal gene transfer in bacterialand archaeal complete genomes. Genome Res., 10(11), 1719–1725.

16

Page 25: THE UNIVERSITY OF CHICAGO VISUAL REPRESENTATION …

[15] Koonin, E.V., Makarova, K.S., Aravind, L. (2001) Horizontal gene transfer in prokary-otes: quantification and classification. Annu. Rev. Microbiol., 55(1), 709-742.

[16] Nakamura, Y., Itoh, T., Matsuda, H., Gojobori, T., (2004) Biased biological functionsof horizontally transferred genes in prokaryotic genomes. Nat. Genet., 36(7), 760–766.

[17] Choi, I.G., Kim, S.H., (2007) Global extent of horizontal gene transfer. PNAS, 104(11),4489–4494.

[18] Casjens, S. (2003) Prophages and bacterial genomics: what have we learned so far?Mol. Microbiol., 49, 277—300.

[19] Casjens, S., Palmer, N., van Vugt, R., Huang, W.M., Stevenson,B., Rosa, P., Lathigra,R., Sutton, G., Peterson, J., Dodson, R.J. et al. (2000) A bacterial genome in Fux: thetwelve linear and nine circular extrachromosomal DNAs in an infectious isolate of theLyme disease spirochaete Borrelia burgdorferi. Mol. Microbiol., 35, 490-–516.

[20] Canchaya, C., Proux, C., Fournous, G., Bruttin, A., Brussow, H. (2003) Prophagegenomics. Microbiol. Mol. Biol. Rev., 67, 238-–276.

[21] Arndt, D., Grant, J., Marcu, A., Sajed, T., Pon, A., Liang, Y., & Wishart, D. (2016)PHASTER: a better, faster version of the PHAST phage search tool. Nucleic AcidsResearch, 44(W1), W16–W21.

[22] Zhou, Y., Liang, Y., Lynch, K., Dennis, J., & Wishart, D. (2011) PHAST: a fast phagesearch tool. Nucleic Acids Research, 39, W347–W352.

[23] Coates, A.R., Hu, Y. (2007) Novel approaches to developing new antibiotics for bacterialinfections. Br. J. Pharmacol., 152, 1147—1154.

[24] Bar, H., Yacoby, I., Benhar,I. (2008) Killing cancer cells by targeted drug-carrying phagenanomedicines. BMC Biotechnol., 8, 37.

[25] Hacker, J., Blum-Oehler, G., Muhldorfer, I., Tschape, H. (1997) Pathogenicity islands ofvirulent bacteria: structure, function and impact on microbial evolution. Mol. Microbiol.,23, 1089—1097.

[26] Schmidt H, Hensel M. (2004) Pathogenicity Islands in bacterial pathogenesis. Clin.Mcrobiolog. Rev., 17, 14—56.

[27] Ho Sui, S.J,, Fedynak, A., Hsiao, W.W.L., Langille, M.G.I., Brinkman, F.S.L. (2009)The association of virulence factors with genomic islands. PLoS One, 4, e8094.

[28] Moriel, D.G., Bertoldi, I., Spagnuolo, A., Marchi, S., Rosini, R., et al. (2010) Identifi-cation of protective and broadly conserved vaccine antigens from the genome of extrain-testinal pathogenic Escherichia coli. Proc. Natl. Acad. Sci. U S A, 107, 9072-–9077.

[29] Fouts, D.E. (2004) Bacteriophage bioinformatics. In Fraser, C.M., Read, T.D., Nelson,K.E. (eds), Microbial Genomes. Humana Press Inc., Totowa, NJ, pp. 71—91.

17

Page 26: THE UNIVERSITY OF CHICAGO VISUAL REPRESENTATION …

[30] Nicolas, P., Bize, L., Muri, F., Hoebeke, M., Rodolphe,F ., Ehrlich, S.D., Prum, B.,Bessieres,P. (2002) Mining Bacillus subtilis chromosome heterogeneities using hiddenMarkov models. Nucleic Acids Research, 30, 1418—1426.

[31] Srividhya, K., Alaguraj, V., Poornima, G., Kumar, D., Singh, G.P., Raghavenderan, L.,Katta, M., Mehta, P., Krishnaswamy, S. (2007) Identification of prophages in bacterialgenomes by dinucleotide relative abundance difference. PLoS One, 2, e1193.

[32] Gogarten, J.P., Townsend, J.P.. (2005) Horizontal gene transfer, genome innovation andevolution. Nat. Rev. Microbiol., 3, 679-–687.

[33] Soucy, S.M., Huang, J., Gogarten, J.P. (2015) Horizontal gene transfer: building theweb of life. Nat. Rev. Genet., 16, 472–482.

[34] Hacker, J., Bender, L., Ott, M., Wingender, J., Lund, B., et al. (1990) Deletions ofchromosomal regions coding for fimbriae and hemolysins occur in vitro and in vivo invarious extra intestinal Escherichia coli isolates. Microb. Pathog., 8, 213-–225.

[35] Vernikos, G. S., Parkhill, J. (2008) Resolving the structural features of genomic islands:a machine learning approach. Genome Res., 18, 331—342.

[36] Greub, G., Collyn, F., Guy, L., Roten, C.A. (2004) A genomic island present along thebacterial chromosome of the Parachlamydiaceae UWE25, an obligate amoebal endosym-biont, encodes a potentially functional F-like conjugative DNA transfer system. BMCMicrobiology, 4(1), 48

[37] Lawrence, J.G., Ochman, H. (1998) Molecular archaeology of the Escherichia coligenome. Proceedings of the National Academy of Sciences, 95(16), 9413—9417.

[38] Vernikos, G. S., Parkhill, J. (2006) Interpolated variable order motifs for identificationof horizontally acquired DNA: revisiting the Salmonella pathogenicity islands. Bioinfor-matics, 22, 2196—2203.

[39] Karlin, S., Mrazek, J. & Campbell, A. M. (1998) Codon usages in different gene classesof the Escherichia coli genome. Mol. Microbiol., 29, 1341—1355.

[40] Karlin, S. (2001) Detecting anomalous gene clusters and pathogenicity islands in diversebacterial genomes. Trends Microbiol., 9, 335—343.

[41] Sandberg, R., et al. (2001) Capturing whole-genome characteristics in short sequencesusing a naive Bayesian classifier. Genome Res., 11, 1404—1409.

[42] Tsirigos, A., Rigoutsos, I. (2005) A new computational method for the detection ofhorizontal gene transfer events. Nucleic Acids Research, 33, 922—933.

[43] Hatfull, G.F., Jacobs-Sera, D., Lawrence, J.G., Pope, W.H., Russell, D.A., Ko, C.C.,Weber, R.J., Patel, M.C., Germane, K.L., Edgar, R.H., et al. (2010) Comparative ge-nomic analysis of 60 mycobacteriophage genomes: genome clustering, gene acquisition,and gene size. J. Mol. Biol., 397, 119—143.

18

Page 27: THE UNIVERSITY OF CHICAGO VISUAL REPRESENTATION …

[44] Langille, M.G., Hsiao, W.W., Brinkman, FS. (2008) Evaluation .of genomic island pre-dictors using a comparative genomics approach. BMC Bioinformatics, 9, 329.

[45] Williams, K.P. (2002) Integration sites for genetic elements in prokaryotic tRNA andtmRNA genes: sublocation preference of integrase subfamilies. Nucleic Acids Res., 30,866-–875.

[46] Reiter, W.D., Palm, P., Yeats, S. (1989) Transfer RNA genes frequently serve as inte-gration sites for prokaryotic genetic elements. Nucleic Acids Research, 17, 1907-–1914.

[47] Bellanger, X., Payot, S., Leblond-Bourget, N., Guedon, G. (2014) Conjugative andmobilizable genomic islands in bacteria: evolution and diversity. FEMS Microbiol. Rev.,38, 720—760.

[48] Srividhya, K.V., Rao, G.V., Raghavenderan, L., Mehta, P., Prilusky,J., Manicka,S.,Sussman, J.L., Krishnaswamy, S. (2006) Database and comparative identification ofprophages. In: Huang,D-S, Li,K and Irwin,GW (eds). Intelligent Control and Automa-tion, Lecture Notes in Control and Information Sciences. Springer, Berlin, Vol. 344, pp.863—868.

[49] Ester, M., Kriegel, H., Sander, J., Xu, X. (1996) A density-based algorithm for discov-ering clusters in large spatial databases with noise. In: KDD-1996 Proceedings AAAIPress, Menlo Park, pp. 226—231.

[50] Hsiao, W., Wan, I., Jones, S.J., et al. (2003) IslandPath: aiding detection of genomicislands in prokaryotes. Bioinformatics, 19(3), b418—420.

[51] Waack, S., Keller, O. Asper, R., et al. (2006) Score-based prediction of genomic islandsin prokaryotic genomes using hidden Markov models. BMC Bioinformatics, 7, 142.

[52] Tu, Q., Ding, D. (2003) Detecting pathogenicity islands and anomalous gene clustersby iterative discriminant analysis. FEMS Microbiol. Lett., 221, 269—275.

[53] Fouts, D. (2006) Phage Finder: automated identification and classification of prophageregions in complete bacterial genome sequences. Nucleic Acids Res., 34, 5839-–5851.

[54] Langille, M.G., Brinkman, F. IslandViewer: an integrated interface for computationalidentification and visualization of genomic islands. Bioinformatics, 25, 664-–665.

[55] Nelson, K.E., Weinel, C., Paulsen, I.T., Dodson, R.J., Hilbert, H., Martins dos San-tos, V.A., Fouts, D.E., Gill, S.R., Pop, M., Holmes, M. et al. (2002) Complete genomesequence and comparative analysis of the metabolically versatile Pseudomonas putidaKT2440. Environ. Microbiol., 4, 799—808.

[56] Lawrence, J. G., Ochman, H. (1997) Amelioration of bacterial genomes: rates of changeand exchange. J. Mol. Evol., 44, 383—397.

[57] Zhang, R., Zhang, C.T. (2004) A systematic method to identify genomic islands andits applications in analyzing the genomes of Corynebacterium glutamicum and Vibriovulnificus CMCP6 chromosome I. Bioinformatics, 20(5), 612—622.

19

Page 28: THE UNIVERSITY OF CHICAGO VISUAL REPRESENTATION …

[58] Romero,P.R. and Karp,P.D. (2004) Using functional and organizational informationto improve genome-wide computational prediction of transcription units on pathway-genome databases. Bioinformatics, 20, 709–717.

[59] Mao F, Dam P, Chou J, et al (2009) . DOOR: a database for prokaryotic operons.Nucleic Acids Res 2009;37:459–463.

[60] Haller R, Kennedy M, Arnold N, Rutherford R (2010). The transcriptome of Mycobac-terium tuberculosis. Appl Microbiol Biotechnol. 2010; 86:1–9.

[61] Pelly, S., Winglee, K., Xia, F., Stevens, R. L., Bishai, W. R., & Lamichhane, G. (2016).REMap: operon map of M. tuberculosis based on RNA sequence data. Tuberculosis, 99,70-80.

[62] Taboada, B., Estrada, K., Ciria, R., & Merino, E. (2018). Operon-mapper: a web serverfor precise operon identification in bacterial and archaeal genomes. Bioinformatics, 34(23), 4118-4120.

[63] Price, M. N., Huang, K. H., Alm, E. J., & Arkin, A. P. (2005). A novel method foraccurate operon predictions in all sequenced prokaryotes. Nucleic acids research, 33 (3),880-892.

[64] Zaidi, S. S. A., & Zhang, X. (2016). Computational operon prediction in whole-genomesand metagenomes. Briefings in functional genomics, 16 (4), 181-193.

[65] Taboada, B., Ciria, R., Martinez-Guerrero, C. E., & Merino, E. (2011). ProOpDB:Prokaryotic Operon Data Base. Nucleic acids research, 40 (D1), D627-D631.

[66] Taboada, B., Verde, C., & Merino, E. (2010). High accuracy operon prediction methodbased on STRING database scores. Nucleic acids research, 38 (12), e130-e130.

[67] Bergman, N. H., Passalacqua, K. D., Hanna, P. C., & Qin, Z. S. (2007). Operon predic-tion for sequenced bacterial genomes without experimental information. Appl. Environ.Microbiol., 73 (3), 846-854.

[68] Fortino, V., Smolander, O. P., Auvinen, P., Tagliaferri, R., & Greco, D. (2014). Tran-scriptome dynamics-based operon prediction in prokaryotes. BMC bioinformatics, 15 (1),145.

[69] Fran B, Perrin D, Monod J, et al (1960). The operon : a group of genes whose expressionis coordinated by an operator. J Bacteriol;29:1727–9.

[70] Okuda S, Kawashima S, Kobayashi K, et al (2007). Characterization of relationshipsbetween transcriptional units and operon structures in Bacillus subtilis and Escherichiacoli. BMC Genomics;8:48.

[71] Hodgman TC (2000). A historical perspective on gene/protein functional assignment.Bioinformatics;16:10–5.

20

Page 29: THE UNIVERSITY OF CHICAGO VISUAL REPRESENTATION …

[72] Joon M, Bhatia S, Pasricha R, et al (2010). Functional analysis of an intergenic non-coding sequence within mce1 operon of M. tu- berculosis. BMC Microbiol;10:128.

[73] Wang S, Wang Y, Liang Y, et al (2007). A multi-approaches-guided genetic algorithmwith application to operon prediction. Artif Intell Med;41:151–9.

[74] Pantosti A, Sanchini A, Monaco M (2007). Mechanisms of antibiotic resistance inStaphylococcus aureus. Future Microbiol;2: 323–34.

[75] Yada,T., Nakao,M., Totoki,Y. and Nakai,K. (1999) Modeling and predicting transcrip-tional units of Escherichia coli genes using hidden Markov models. Bioinformatics, 15:987–993.

[76] Craven,M., Page,D., Shavlik,J., Bockhorst,J. and Glasner,J. (2000) A probabilisticlearning approach to whole-genome operon pre- diction. Proc. Conf. Intell. Syst. Mol.Biol., 8: 116–127.

[77] Tjaden,B., Haynor,D.R., Stolyar,S., Rosenow,C. and Kolker,E. (2002) Identifying oper-ons and untranslated regions of transcripts using Escherichia coli RNA expression anal-ysis. Bioinformatics, 18 (Suppl. 1): S337–S344.

[78] Ermolaeva,M.D., White,O. and Salzberg,S.L. (2001) Predic- tion of operons in microbialgenomes. Nucleic Acids Res., 29: 1216–1221.

[79] Zheng,Y., Szustakowski,J.D., Fortnow,L., Roberts,R.J. and Kasif,S. (2002) Computa-tional identification of operons in microbial genomes. Genome Res., 12: 1221–1230.

[80] Ogata,H., Fujibuchi,W., Goto,S. and Kanehisa,M. (2000) A heur- istic graph comparisonalgorithm and its application to detect functionally related enzyme clusters. Nucleic AcidsRes., 28: 4021–4028.

[81] Chen,X., Su,Z., Xu,Y. and Jiang,T. (2004) Computational prediction of operons inSynechococcus sp. WH8102. Genome Inform., 15, 211–222.

[82] Tran,T.T., Dam,P., Su,Z., Poole,F.L., Adams,M.W., Zhou,G.T. and Xu,Y. (2007)Operon prediction in Pyrococcus furiosus. Nucleic Acids Res., 35, 11–20.

[83] Zhang,G.Q., Cao,Z.W., Luo,Q.M., Cai,Y.D. and Li,Y.X. (2006) Operon predictionbased on SVM. Comput. Biol. Chem., 30, 233–240.

[84] Dam,P., Olman,V., Harris,K., Su,Z. and Xu,Y. (2007) Operon prediction using bothgenome-specific and general genomic information. Nucleic Acids Res., 35, 288–298.

[85] Bockhorst,J., Craven,M., Page,D., Shavlik,J. and Glasner,J. (2003) A Bayesian networkapproach to operon prediction. Bioinformatics, 19, 1227–1235.

[86] Edwards,M.T., Rison,S.C., Stoker,N.G. and Wernisch,L. (2005) A universally applica-ble method of operon map prediction on minimally annotated genomes using conservedgenomic context. Nucleic Acids Res., 33, 3253–3262.

21

Page 30: THE UNIVERSITY OF CHICAGO VISUAL REPRESENTATION …

[87] Westover,B.P., Buhler,J.D., Sonnenburg,J.L. and Gordon,J.I. (2005) Operon predictionwithout a training set. Bioinformatics, 21, 880–888.

[88] Jacob,E., Sasikumar,R. and Nair,K.N. (2005) A fuzzy guided genetic algorithm foroperon prediction. Bioinformatics, 21, 1403–1407.

[89] Salgado,H., Moreno-Hagelsieb,G., Smith,T.F. and Collado- Vides,J. (2000) Operonsin Escherichia coli: genomic analyses and predictions. Proc. Natl Acad. Sci. USA, 97,6652–6657.

[90] Okuda,S., Kawashima,S., Kobayashi,K., Ogasawara,N., Kanehisa,M. and Goto,S.(2007) Characterization of relationships between transcriptional units and operon struc-tures in Bacillus subtilis and Escherichia coli. BMC Genomics, 8, 48.

[91] Yan,Y. and Moult,J. (2006) Detection of operons. Proteins, 64, 615–628.

[92] Overbeek, R., M. Fonstein, M. D’Souza, G. D. Pusch, and N. Maltsev. (1999). The useof gene clusters to infer functional coupling. Proc. Natl. Acad. Sci. USA 96:2896–2901.

[93] Brouwer RWW, Kuipers OP, Van Hijum SAFT (2008). The relative value of operonpredictions. Brief Bioinform:367–75.

[94] Wattam, A.R, ZDavis, J.J., Assaf, R., Boisvert, S., Bun, T., Conrad, N., Dietrich, E.M.,Disz, T., Gabbard, J.L., Gerdes, S., Henry, C.S., Kenyon, R.W., Machi, D., Mao, C.,Nordberg, E.K., Olsen, G.J., Murphy-Olson, D.E., Olson, R., Overbeek, R., Parrello,B., Pusch, G.D., Shukla, M., Vonstein, V., Warren, A., Xia, F., Yoo, H., Stevens, R.L.(2017) Improvements to PATRIC, the all-bacterial Bioinformatics Database and AnalysisResource Center. Nucleic Acids Research 45(D1), D535–D542.

[95] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy,A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.. (2015) ImageNet large scale visualrecognition challenge. IJCV.

[96] Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z. (2016) Rethinking theInception Architecture for Computer Vision. The IEEE Conference on Computer Visionand Pattern Recognition (CVPR), 2818–2826.

[97] Davis JJ, Gerdes S, Olsen GJ, Olson R, Pusch GD, Shukla M, Vonstein V, WattamAR and Yoo H (2016) PATtyFams: Protein Families for the Microbial Genomes in thePATRIC Database. Front. Microbiol. 7:118. doi: 10.3389/fmicb.2016.00118

[98] How to Retrain an Image Classifier for New Categories - TensorFlow Hub — TensorFlow.(2018) Retrieved from https://www.tensorflow.org/hub/tutorials/image retraining

[99] FastAI — FastAI. (2018). Retrieved from https://docs.fast.ai/index.html

22