identification and characterization of non … · identification and characterization of non-human...

IDENTIFICATION AND

CHARACTERIZATION OF NON-HUMAN

ENTITIES IN CANCER SAMPLES

Estudiante: José Alejandro Romero Herrera

MÁSTER EN BIOINFORMÁTICA Y BIOLOGÍA COMPUTACIONAL

ESCUELA NACIONAL DE SALUD- INSTITUTO DE SALUD CARLOS III

2014-2015

Center for Biological Sequence Analysis (CBS).Lyngby, Dinamarca

Supervisor: Assoc. Prof. José MG Izarzugaza, PhD

FECHA: 02/2016

Index

Index ................................................................................................................................. 2

Abstract ......................................................................................................................... 3

Objectives ..................................................................................................................... 3

Introduction .................................................................................................................. 4

Cancer and Pathogen project and the importance of virus in cancer onset. ............. 4

Virus contamination in cancer samples .................................................................... 5

Detection of P. acnes in patient samples .................................................................. 6

Sample preparation and pre-processing .................................................................... 6

Computerome: High Performance Computing at DTU ............................................ 7

Bioinformatics pipeline ............................................................................................ 7

Materials and methods .................................................................................................. 9

Samples and reads .................................................................................................... 9

Datasets of reference genomes. .............................................................................. 10

Alignment and mapping ......................................................................................... 11

Circos plots ............................................................................................................. 11

Script modification ................................................................................................. 12

Results ........................................................................................................................ 14

Alignment and mapping with BWA ....................................................................... 14

Viruses .................................................................................................................... 14

Propionibacterium acnes ........................................................................................ 14

Circos plots ............................................................................................................. 15

Viral contamination ................................................................................................ 17

Discussion ................................................................................................................... 21

Viral cancer drivers ................................................................................................ 21

Virus contamination ............................................................................................... 21

Propionibacterium acnes ........................................................................................ 23

Utility of Circos plots ............................................................................................. 24

Differences between BLASTn and BWA alignment results in the virus genomes 25

Lack of negative controls ....................................................................................... 25

Conclusion .................................................................................................................. 26

Acknowledgements .................................................................................................... 26

Annexes. Supplemental Material ................................................................................ 27

References .................................................................................................................. 27

Abstract

It has been calculated that almost one fifth of cancer cases are caused by viral and

microbial entities, but only a few viruses are known to trigger cancer onset. However,

there are many mammal lymphoma-proliferative diseases caused by retroviruses, and

thus, it is very probable that there are still many unknown viruses related to cancer.

High-throughput sequencing techniques allow researchers to find and detect the

presence of these organisms, even if they constitute a minor fraction of the nucleotidic

content of the infected cell. Unfortunately, The high sensitivity of current

methodologies is capable of detecting not only the main entities present in the sample

but also other minority species, sometimes, spurious contaminants present in the sample

or the laboratory reagents used. Therefore, caution should be exerted when establishing

causative associations to disease. For example, Propionibacterium acnes (P. acnes) is a

bacterium that has been associated to several medical conditions, but since it is a known

contaminant of surgical wounds and tools, its role in these diseases is highly

controversial. Thus, identifying organisms that are contaminants of biological samples

would ease this type of research. This Master Thesis focuses in the taxonomical

characterization of the species present in 900 cancer specimens collected in different

Danish hospitals. Our results show the presence of several oncoviruses as well as other

viral contaminants in these cancer samples. We also confirm the presence of P. acnes

among these samples, and discuss its possible association either to cancer or

contamination.

Objectives

- Improve the understanding of common bioinformatics tools. In particular those used in

the analysis of biological sequences and next generation sequencing data.

- Identify possible oncoviruses in NGS data from cancer patient samples

- Identify recurrent contaminant of viral origin in those samples.

- Study the implication of P. acnes in cancer onset and evaluate its suggested role as

contaminant.

Introduction

Cancer and Pathogen project and the importance of virus in

cancer onset.

It has been estimated that viral and microbial infections play an essential role in cancer

development in almost one fifth of cancer cases (18.6%)1. There are many important

viral entities among these infections such as human papillomaviruses (HPV)2, Epstein-

Barr virus (EBV) and hepatitis B and C virus (HBV and HCV, respectively).

Retroviruses are also involved in cancer onset, but many times their relation to this

disease is still discussed3. Nevertheless, there are several animal lympho-proliferative

diseases, such as leukemia or lymphoma, caused by retroviruses4. Thus, it is very likely

that there are unknown viruses related to human cancer and to lympho-proliferative

diseases. Since only a few prophylactic vaccines are yet available, the detection of such

pathogens is challenging and necessary. The Cancer and Pathogen project was designed

in order to identify unknown viruses and microorganisms in cancer samples, and

recognize those associated with cancer onset. This project is part of the

GenomeDenmark platform, and uses DNA and RNA purification techniques, as well as

sequencing and several bioinformatic approaches. The present Master Thesis is

developed within The Cancer and Pathogen project and focuses on the characterization

and identification of entities in cancer specimens, paying special attention to the role of

viral species. In addition, I also explore the possible association of Propionibacterium

acnes (P. acnes) to cancer development. This way, new diagnose methods and anti-

cancer vaccines could be discovered from the selected microorganisms, demonstrating

that genomics is a technology that might achieve this goal relatively fast.

There are, however, several obstacles in viral discovery studies like ours. First, the

proportion of host-derived genetic material (human in our case) is usually many times

bigger compared to viral nucleic acids in a cancer sample. Viral genomes are relatively

small, and hence constitute a minor fraction of the genome of the infected host cell. In

addition, the infected cells could be only a proportion of the whole sample and they may

contain a low number of viral genome copies.

On the other hand, extensive variation in virus sequences makes viral classification a

challenge. For example, human immunodeficiency virus 1 (HIV-1) inter-subtype

sequence varies more than 35% in the env gene5 and more than 10% in the gag and pol

genes6. This problem increases when additional species-specific genes contribute to

sequence variation between virus species. This is especially relevant when retrieving

genomic material from unknown species, since amplification of variant sequences using

standard levels of PCR stringency would prevent enough duplication of viral nucleic

acids.

Fortunately, there are currently high-specific methods to overcome these obstacles.

High-throughput sequence-independent shotgun sequencing can be used to improve

sensitive detection of unknown viral sequences, since it allows the comprehensive

sampling of the whole set of sequences in the pool of organisms present in a given

complex sample. This method enables the evaluation of genome diversity and provides

a way to study microorganisms that are otherwise difficult or impossible to analyze in

vitro7. In addition, lower stringency levels of PCR can be applied when the

amplification of variant sequences poses a challenge. The quantitative disproportion

between host and viral genomic material can be undertaken by mechanical and

enzymatic procedures8. These procedures are based on DNase treatment followed by

restriction enzyme digestion and sequence-independent single primer amplification

(SISPA9) of the fragments. However, the DNase-SISPA method is not viable for

studying integrated or episomal viral nucleic acids. Instead, target capture (target

enrichment by hybridization10) can be used for enrichment of high-throughput

sequencing libraries11, and would be a more adequate approach for our study.

Virus contamination in cancer samples

Ultimately, these methods provide the ability to recover almost every possible sequence

contained in the sample. Identifying those entities that are actually related to cancer

onset from those that are not is a challenging step. In fact, these methods are so

sensitive that it is possible to detect in biological samples viral entities that neither

oncoviruses, neither related somehow to human cancer development. Several of these

viruses should not be found in humans (e.g Avian leukosis virus12). Contamination of

samples is the most reasonable explanation for these findings. The use of laboratory

reagents and materials, as well as human errors, might be the cause of sample

contamination. We agreed with our collaborators at CGG on a list of 159 reference

genomes (GI entries) that are known contaminants of samples, lab kits and reagents. In

this study, the abundance of these genomes will be analyzed and discussed.

Detection of P. acnes in patient samples

Propionibacterium acnes is a Gram-positive bacterium that inhabits the healthy flora of

the skin, oral cavity, large intestine, the conjunctiva and external ear canal13. P. acne's

role in the pathophysiological mechanisms of acne is already proposed14 and its genome

well characterized. A study has shown several genes that can generate enzymes for

degrading skin and proteins that may be immunogenic15. Furthermore, rapid growth of

P. acnes can produce cellular damage and metabolic products, triggering inflammation

and other medical symptoms16.

During the last years, an increasing number of reports have implicated the bacterium as

an opportunistic pathogen responsible for a wide range of disease conditions13, such as

cerebrospinal shunts, ocular infections, sarcoidosis and prostate cancer. Contrarily,

some studies postulate that P. acnes is a contaminant of surgical wounds, blood

products and tissue cultures13,17. Therefore, the speculated role of P. acnes in prostate

cancer18, among other diseases, is still highly controversial.

Knowing that this bacterium is a candidate suspected of causing cancer, we also

investigated the presence of 12 P. acnes strains among the cancer samples.

Sample preparation and pre-processing

For the purpose of this study, we analyzed around 900 samples collected in different

Danish hospitals. Specimens included a wide range of cell types: acute myeloid

leukaemia, basal cell carcinoma, breast cancer, colon cancer, vulva cancer, etc.

Samples and sequencing libraries were prepared by our partners at the Center for

GeoGenetics (CGG). This preparation was performed using different techniques, such

as virion or microbial enrichment and shotgun sequencing or target capture with specific

probes (articles submitted). Thus, each sample could be analyzed with more than one

technique, giving rise to more than 1300 combinations. These combinations generated

independent libraries

On the other hand, library sequencing was performed by our partners at the Copenhagen

outstation of the Beijing Genome Institute (BGI). Each one of those libraries was

sequenced independently, making a total number of more than 4x109 reads. This huge

amount of data could not be processed in a normal computer. In order to analyze this

data, a supercomputer was used.

Computerome: High Performance Computing at DTU

Computerome is the new Danish supercomputer for life sciences that connects several

universities and institutes among the world. It is composed of 16048 cores, with a total

of 96TB of global RAM memory (DDR4). Jobs can be sent to the Unix supercomputer,

and they can perform commands or self-made scripts. Each job would require different

configuration of RAM, processing units (nodes, CPUs) and software modules. A task

manager supervises the system by controlling the execution and the allocation of shared

resources, in order to avoid the supercomputer overuse and collapse. Furthermore, the

availability of resources allows for parallelization of analyses, which constitutes a

substantial gain in time for a requiring project as ours.

Bioinformatics pipeline

The bioinformatic analysis performed on the raw paired-end sequencing data is similar

to the one performed by Lysholmet al. (2012)19. First, reads were first pre-processed

and filtered. Second, reads were devoid of human sequences. This step is necessary

because we are only interested in the non-human fraction of the samples. Third, human-

depleted reads were queried against the NCBI nucleotide database with a low restrictive

BLASTn search to provide their taxonomic characterization. This way, the highest

amount of reads would be taxonomically assigned to an organism. The obtained

BLASTn hits were analyzed and different Genome Identifiers (GIs) could be pointed

out. These GIs were used to study their abundance in the cancer samples. For this

purpose, reads were mapped against these GIs.

The raw results obtained in this last step are huge and consist only of large text data

classified by their reference genome. Trying to understand this type of information

would be complex and arduous, and manually inspection of the files would be a very

time-consuming task. Parsing the data files into coverage information of the reference

genomes would be an easy way to inspect the presence of a given reference genome

among the samples. Nevertheless, the coverage percentage of the reference genomes

would not show which regions of the reference genome are actually present in the

samples. This type of information can be very important because it could show some

kind of patterns among the samples. For example, if a region of a reference genome is

persistently found in all samples, it could be interesting to make further analysis that

could ultimately lead to potential discoveries .

This way, although data text can be processed and

resumed in the form of coverage percentage, it would

be still a poor representation of all the retrieved

information. Therefore, another computational

approach must be taken. Visualization and graphic

tools, such as plots, might be a more adequate way to

display our data. Graphic plots can show many times

more information, and it would be easy to read. These

plots could represent the coverage count (how many

times a position of the genome is read) for each

reference and for each sample.

The visualization tool chosen in this study is Circos20. Circos is a flexible software

package that can display data in an attractive circular layout, making it ideal to explore

connections between objects or positions of a genome (Figure 1). In addition, Circos

can be automated and can be perfectly incorporated into our pipeline, creating plots in

PNG format. Since our data can be easily parsed and the Circos software offers many

different options, additional data can be displayed. For example, it could be interesting

to see which positions of the reference genome correspond to coding DNA sequences

(CDS) regions. This could help to identify patterns or reveal new information, rather

than just providing coverage of the reference sequence. For this purpose, we can use

feature tables for each reference provided by the NCBI database.

This way, the third step of this pipeline is the processing and visualization of the

mapped human-depleted fraction of the reads in the form of Circos plots. Finally, these

plots can be analyzed and conclusions can be made. A flow chart describing this

pipeline can be found in Figure 2.

Figure 1. Graphical uses of

Circos plots.

Materials and methods

Samples and reads

Human cancer specimens were obtained from different Danish hospitals. Sequencing

libraries were prepared by CGG and were assigned a number, internally termed

"snumber". A table containing all the snumbers and their respective samples can be

found in the Supplemental Material. Several snumbers might represent the same sample

if they gave rise to different sequencing libraries, i.e. if a different treatment was applied

in the wet lab. Sequencing was performed at BGI on Illumina Hiseq 2000 platform as

100 bp paired-end reads.

Pre-processing of raw reads was performed at CBS. First, paired-end reads were pre-

processed to eliminate remaining sequencing adaptors and filtered for low quality

sequencing using AdaptorRemoval. In addition, overlapping pairs were collapsed into a

Figure 2. Flow chart that represents all steps of the study. (A) Sample preparation

and read pre-processing made at CGG and CBS, respectively (B) Bioinformatics

pipeline developed in this thesis.

single read. Second, sequencing libraries were devoid of human sequences. Reads were

mapped to the version hg38 of the human genome using BWA MEM alignment

algorithm21, 22, which is ideal for short reads and long sequences up to few megabases.

Reads showing similarity to the human reference were discarded and the remaining

progressed for further analysis. However, if only one read of a pair belonged to the non-

human fraction, this proceeded as a singleton. Finally, low complexity regions were

filtered out using the DustMasker algorithm. All read files were in FASTQ format.

Third, the human-devoid reads were queried against the NCBI non-redundant

nucleotide database. BLASTn hits targeting the same organism were considered

together and the percentage of the organism covered by at least one BLASTn hit

computed. In order to consider a BLASTn hit, the similarity had to be greater or equal

than 20% and the e-value smaller than 0.001. Those that were not characterized were

assigned to a "Dark Matter" group.

Datasets of reference genomes.

The analysis of the BLASTn results suggested three different types of organisms,

characterized by their Genome Identifiers (GI): 159 GIs of known contaminants, 90 GIs

of possible viral cancer drivers, and 12 GIs of Propionibacterium acnes complete

genomes. Different types of viral sequences composed the first two lists. These could be

either full genomes, partial genomes or just gene regions of the virus. These are a few

virus examples from the contamination list: Avian leukosis virus, Avian myeloblastosis-

associated virus, Parvo-like hybrid virus, Acanthamoeba mimivirus, etc. Different

flavours of the Avian leukosis virus constitute the most abundant organism in this list

(56 GIs), followed by Parvo-like viruses and Propionibacterium phages (11 GIs each),

and Avian myelobastosis virus (7 GIs).

On the other hand, some viruses could be found in the cancer driver list. These include,

Human adenoviruses, Human herpesvirus, and Human papillomavirus, etc. All lists are

annexed in the Supplemental Material.

All reference genomes were named after their GI. This way, the reference genomes

could be also referred as GI indistinctly. FASTA sequences and feature tables were

downloaded from the NCBI database, using NCBI Batch Entrez website

(http://www.ncbi.nlm.nih.gov/sites/batchentrez).

Alignment and mapping

For the virus genomes, not every read were aligned to all the GIs. In order to reduce the

use of computer resources, a cutoff of a minimum 200 base-pair was set. Only those

read-GI combinations that matched this requirement in the BLASTn read taxonomic

assignation were aligned.

Alignment of the reads to the reference genomes was performed using Burrows-

Wheeler Aligner (BWA) MEM algorithm. Firstly, FASTA sequences from the

reference genomes were treated as databases and indexed using BWA index command:

bwa index -a is (1)

The algorithm used for indexation was IS. IS is the default algorithm and works

perfectly with our small databases of GI sequences. All output files were classified first

by their GI and then by their snumber. SAM files were generated as output, and were

transformed into BAM files using SAMtools23 "view" and “sort”. The command used

for this purpose was:

bwa mem -M -t 16 (1) (2) (3) | samtools view -Sbh -F 4 - | samtools sort - (4)

This way, BAM files were automatically created. (1) is the reference genome database.

(2) is the read fq file, and (3) is the mate fq read file, if it is not a singleton. (4) is the

output name of the BAM file. In order to economize disk space, only correctly mapped

reads were saved (samtools view -F 4 option)

BAM files were merged and mapped to the references with SAMtools "mpileup". The

command used for this purpose was:

samtools mpileup -f (1) -B -Q0 -d 10000000 (n) > (2).pileup

As a result, a single text file in Pileup format were created for each GI-snumber

combination. (1) is the reference genome database. (n) are all reads files of a given

snumber. (2) is the output name of the pileup file.

Circos plots

Circos software uses configuration files in order to draw a plot. These configuration

files can contain different variables that will describe the plot configuration: plot size,

text size, colors, etc. The configuration files also point out paths to the parsed data files,

whose information will be displayed in the plot.

A python script was made to write the Circos configuration files that are necessary to

draw the plots. The program saved all configuration and data files of a given GI in a

new folder that would also contain the images that were created afterwards. This script

also parsed Feature tables and GI FASTA files in order to be used by the Circos

software. All GI sequences and their features were drawn as an highlighted ideogram.

For each GI, the script calculated the accumulated read depth (coverage) of all sample

numbers (snumber). It iterated through all the positions of the GI, adding all read count

values of all snumber reads in each position. Accumulated read depth from all samples

was displayed in every plot as a green histogram.

Read depths for every position of the GI were showed as orange histograms and they

were labeled by their snumber. A maximum of 9 snumbers were displayed per plot, so

several plots could be made from a single GI. In short, each plot containeda highlighted

ideogram of a given GI, an accumulated green histogram from all snumbers that were

mapped against that GI, and a maximum of 9 snumber orange histograms. The script

made more plots if a GI was mapped with more than 9 snumber reads,. Consequently,

more than one plot could be made for a single GI.

Another script was written in order to automatically produce PDF files of the plots. This

script had three functions: (1) use the Circos software to draw the plots in PNG format,

(2) convert PNG files into PDF files, (3) merge all PDF files of a GI into a single PDF

file.

All merged pdf files were named after their GI, with a "_total.pdf" suffix. All merged

pdf files can be found in the Supplemental Material.

The Python scripts are annotated and can be found in the Supplemental Materials.

Script modification

The script used for writing the configuration files was modified in order to increase the

size of the histograms plotted by the Circos software. The script now iterates through all

positions of the reference genome and collapse those positions that have the same read

depth value. This way, the histograms can include more positions, and this, they are

bigger, instead of creating one histogram per position. In addition, the script now forces

the Circos software to increase the number of pixels in the png file from 1000x1000 to

3000x3000 pixels.

Ultimately, these changes increased the number of positions represented in the plots

when the size of the genome studied is large.

Figure 3. A Circos plot example. This plot shows the proviral sequence of the Avian

leukosis virus (GI:14324120). In this case, the reference genome (outer ideogram) is

widely covered (except the envelop protein gene) by the accumulated read count

(green histogram), and the different snumber reads (inner histograms). CDS regions

are labeled in the outer radius of the ideogram, and highlighted inside of it.

Results

Alignment and mapping with BWA

Viruses

All virus GIs were downloaded from the NCBI database and prepared for alignment

with the reads. Only those read-GI combinations that met the 200 base pair requirement

were aligned. BAM files for each GI were created and used for mapping with SAMtools

"mpileup", producing their respective pileup text file. Some pileups were empty,

meaning that none of the reads matched against the reference genome. These files were

marked and discarded.

Reads mapped successfully against 42 of the 90 (~47%) GIs from the list of possible

cancer drivers. Some GIs, like those corresponding to Human parvoviruses, had a high

amount of reads mapped from several sequencing libraries, but in most of the cases, the

GIs had a low amount of reads mapping to them.

Concomitantly, 115 of 159 (~72%) GIs from the contamination list had at least one read

mapped to them. In this case, the majority of these GIs had a huge amount of reads

mapped from different samples, such as Avian leukosis virus GIs or Parvo-like virus

GIs, compared to the previous virus list.

It is noticeable that some of the GIs could not be mapped, when all of them were found

previously found in the BLASTn search. This conflict lead us to repeat the alignment of

the reads against the reference genomes. A random sample of all virus GIs that could

not be mapped was re-aligned with BWA MEM algorithm, shutting off the 200 base

pair threshold. Unfortunately, none of the reads could be mapped. We believe that the

difference originates from the different tolerance to changes between the two

methodologies. BLASTn allows for exploration of distantly related sequences, whereas

BWA is optimized for the rapid identification of similar sequences respect to a

reference genome.

Propionibacterium acnes

12 P. acnes complete genomes were aligned and mapped against the selected snumber

human-depleted reads. Pileup files were created and checked as it was done with the

virus genomes. Again, some pileup files were empty, and thus, marked and discarded.

All genomes had reads successfully mapped, so all the GIs could be plotted.

Circos plots

Circos plots were drawn using their respective configuration files, and final pdf files

were created. This way, for a given GI a pdf will contain one or more plots. Each plot

has a minimum resolution of 1000x1000 pixels, and a maximum of 3000x3000 pixels,

depending on the size of the GI.

Each plot consists of a circular ideogram that represents the sequence of the GI (Figure

3). The ideogram is filled with the CDS regions parsed from the GI feature table. These

CDS regions are highlighted with different colors to facilitate visualization. Each CDS

product name was labeled in the outer radius of the circular plot. In addition, ticks were

added to display a base pair scale, making easier to recognize regions of the sequence.

In the inner area of the plot there were represented several histograms. The green

histogram corresponds to the accumulated value of the read count for each position of

the sequence. This green histogram is unique and is the same for all the plots of a given

GI. Orange histograms represent the depth of coverage for each snumber at each

position (binned for representation). The histogram maximum height is the highest read

count value found in the accumulated histogram. It is important to note that even if

some snumber histograms are just thin lines, they still show full read coverage of the

reference genome, which would mean that this genomic sequence is present in that

snumber.

Virus suspected of being cancer drivers

48 virus GIs were successfully plotted. As expected from the mapping of the reads, the

reference genomes had a low amount of mappedreads. However, some other viruses

could be more abundant among the samples. For example, John Cunningham virus (JC

virus) was present in some bladder cancer urine samples (Figure 4.a), and the Human

parvovirus B19 (GI 291045153) was present in melanomas and in basal cell carcinoma

samples (Figure 4.b). Human papillomaviruses were observed either in melanoma (GI

238623442) or in oral cavity samples (GI 238623442). On the other hand, only two of

five Human adenovirus strains complete genomes (GI317140354 and 602243850) were

present in cancer samples, whereas the hexon gene of different Human adenovirus

strains was consistently found in many cancer samples. It is noticeable that the fiber

protein genes of the complete genomes of the Human adenovirus strains were missing

(Figure 4.c). One of four Human herpesvirus types (Human herpesvirus 6, GI

587652027) was found in mycosis fungoides samples (Figure 4.d), which are the most

common form of cutaneous T lymphomas. Merkel cell polyomavirus (GI 560940486)

was observed in several melanoma samples.

Figure 4. (A) John Cunninham virus complete genome (GI 2246606). The read

coverage value is high for almost every position .(B) Human parvovirus complete

genome (GI 291045153). More sporadic reads can be observed.(C) Complete

genome of Human adenovirus C strain DD28 (GI 602243850). The fiber protein gene

is not present (marked zone). (D) Complete genome of Human herpesvirus 6A isolate

GS complete genome. This strain seems to be present in some snumber samples,

although the reads are spread along the sequence.

A Bufavirus-2 strain (GI 395318841) seemed

to be present in two melanoma snumber

samples, since several reads were

distributively mapped throughout the genome

(Figure 5).

For the rest of the viruses, only one or two

plots were made (less than 18 snumbers). In

addition, reads showed very low coverage in

most of the cases, which would mean that

those sequences were not present in the

sample. This fact reduced the list of possible

candidates to a few amount of viruses.

Viral contamination

115 GIs from viruses were successfully plotted. In general, a high amount of snumbers

per reference genome could be observed. In many cases the snumbers reads can almost

fully cover their reference genome, which would mean that the studied GI is actually in

that snumber sample. This was observed in most of Avian leukosis and Parvo-like

viruses.

In addition, some interesting patterns could

be observed. For example, the genome of

Avian leukosis virus were consistently found

in many samples (Figure 3). Surprisingly, its

envelop protein gene was missing in most of

the cases. This pattern is also found in Rous

Sarcoma Virus: the virus could be found in

many samples, but its tyrosine kinase gene

was missing (Figure 6).

Some particular observations were made, for

example, the GI sequences 145314587 and

208928. GI 145314587 corresponds to a

Semliki Forest virus expression vector

Figure 5. Complete CDS regions of .

Bufavirus-2 strain BF.39 NS1 putative

VP1, an hypothetical protein, and VP2

genes. Although it is not fully covered,

the distribution of the reads could mean

that it is actually present in the samples.

Figure 6. Rous Sarcoma Virus (GI

3003000). Almost all its genome is

covered in most of the snumbers.

However, its tyrosine kinase gene (in

light blue color) is missing.

(Figure 7.a), which was used in a neurobiological study24. This vector includes a unique

CDS annotated region (a non-structural protein) that was not covered by the reads,

whereas the rest of the vector was present in basal cell carcinome and in B-lymphoma

cell lines. On the other hand, GI 208928 corresponds to a synthetic construct used to

study the promoter activity of the regulatory signals contained in the v-myb oncogene of

Avian myeloblastosis virus25. This regulatory signals were inserted upstream to the

herpes simplex type 1 tymidine kinase gene. This GI can be observed in more than 150

snumber samples, but it is not fully covered. The tymidine kinase gene was not mapped

by any of the samples, although a small region at the start of this gene seemed to be

more or less covered, depending on the snumber observed (Figure 7.b).

Torque Teno Virus (TTV) and SEN virus were found in many Chronic Myelogenous

Leukemia (CML) samples. These viruses are known for being spread worldwide, and

have been detected in different types of biological samples, such as blood, saliva and

feces24, which would explain their presence in CML samples.

Nevertheless, there were some exceptions. For example, Acanthamoeba viruses

appeared to be sporadically covered, with very few reads mapped along the sequence.

This also happened in virus families like Megaviridae and Mimiviridae (Figure 8). All

Figure 7. (A) Semliki Forest virus expression vector (GI 145314587). Only the actual

vector is present in the samples. Red CDS region corresponds to a non-structural

protein. (B) Avian myeloblastosis virus v-myb oncogene promoter region fused to

HSV-1 thymidine kinase gene (GI 208928). Red CDS region corresponds to the kinase

gene.

these viruses are known large DNA viruses, and their complexity might be reason why

they could not be found in these samples.


All 12 complete genomes of P. acnes had, at least, one snumber read correctly mapped,

and were then successfully plotted. In these plots, a scattered pattern and dispersed

distribution was observed along the snumber histograms, even in those that seemed to

fully cover the reference genome. In order to check if this scattered pattern is due to a

Circos error, we compared which positions are actually represented in the plots and

which positions are supposed to be represented. This test showed that many positions

were missing, so the script that parsed the configuration files was modified, and the

plots were repeated (more details in Methods). The number of the sequence positions

represented by histograms in the old and new plots were compared using the Circos

debugging feature. The new plots displayed an average of ~121% more positions than

the previous ones, and almost 219% more positions in some cases (Figure 9).

Figure 8. Large DNA viruses were not present in the samples. (A) Complete genome

of Acanthamoeba castellanii mamavirus strain Hal-V (GI 351737110). (B) Complete

genome of Megavirus chiliensis (GI 350610932). A few sporadic reads can be

observed. These plots also show the high complexity of these genomes.

Most importantly, two or three highly disproportioned read depth regions were observed

in the plots, depending on the strain studied (Figure 10). The ideograms were white in

these exact positions, which meant that there were no CDS features. A check to the

feature tables of the P. acnes strains revealed that these positions belong to ribosomal

RNAs 5S, 16S and 23S.

The plots showed that reads covered P. acnes

genomes in skin-related samples like breast

and oral cavity cancer, myelomas, melanomas

and basal cell carcinomas, and also in optic

neuritis samples. In contrast, there were small

or inexistent traces of P. acnes in many other

sample types, like blood samples, pancreatic

and colon cancer.

Figure 9. Improvement of the plot quality for Propionibacterium acnes plots. (A)

Before collapsing consecutive positions with the same read depth value. (B) After

collapsing consecutive positions with the same read depth value. The quality of the plot

and the number of histograms displayed increased greatly.

Figure 10. Some accumulated regions

were observed in all P. acnes

genomes

Discussion

Viral cancer drivers

We observed the presence of some known viruses related to cancer development like

Human papillomavirus and Merkel cell polyomavirus. These viruses acted like a sort of

positive controls, since they should be more probable to find in cancer samples. Human

papillomavirus was found in oral and melanoma samples, which makes sense since it is

known for being able to infect keratinocytes of the skin or mucous membranes. Merkel

cell polyomavirus was also found in melanoma samples. This virus is suspected to

cause the 80% of Merkel Cell carcinomas (a rare and aggressive form of skin cancer)27,

so it is reasonable to find this virus in skin-related samples. Human herpesvirus 6 was

found in cancer samples, specifically, mycosis fungoides samples. This type of

herpesvirus has already been associated to Hodgkin's lymphomas28,and its presence it is

not a surprise.

The presence of Bufavirus in melanoma samples was not expected since it has mostly

been found in samples of feces of severe diarrhea patients29. This virus could be a new

candidate for further studies and check its true relation to the melanoma samples.

We also observed the presence of the JC virus in bladder cancer urine samples. JC virus

is a type of human polyomavirus that can infect kidney epithelial cells30, which would

explain its presence in the urine samples. As far as we know, this virus has not been

related to bladder cancer, and it could be another interesting candidate for further

studies.

Nevertheless, half of the GIs studied could not be mapped by the BWA aligner. In

addition,many of the remaining GIs were poorly covered, and therefore, they were not

present in the samples. However, this result was expected, since it has been estimated

that only the 20% of the cancer cases are caused by viruses and other microorganisms.

Virus contamination

From a total of 159 GI sequences, 115 had at least one snumber reads correctly mapped.

Different GIs representing the Avian Leukosis virus constituted the most abundant

sequences in the contaminant list of viruses, but this fact could bias their findings. Other

viruses, such as Rous sarcoma virus and some Parvoviridae virus, were also present in

many different cancer samples. Curiously, these viruses had in common that some of

their genes were not mapped, although no explanation was found. On the other hand,

SEN virus and the Torque Teno Virus could also be perfect candidates for contaminant

entities.

We could also observe that some GI sequences were only partially covered because they

were synthetic constructs, instead of real viruses. In these cases, only the non-synthetic

part of the sequences were actually present in the samples. We suggest that the list of

GIs should be carefully checked before analyzing all the data. This would avoid the use

of human and computer resources, and only the interested sequences would be studied.

Moreover, some of these sequences corresponded to partial genomes or just single

genes. The presence of partial genomes or single genes in the samples would not tell

much about the real abundance of the actual organism, and thus, they were not as

relevant as the complete genome sequences. Furthermore, not all of these sequences

could be found among the samples, since the reads did not fully (or almost fully) cover

the reference genome. It should be noted that we might had not found many of the Gis

because their sequences can be very variable and complex, and therefore, the reads did

not mapped against them. Due to the fact that the BWA alignment algorithms were very

strict, this idea is conceivable, and could be the case of large DNA virus families like

Mamaviridae and Megaviridae.

Despite these obstacles, the studied sequences composed a wide variety of viruses. This

allowed us to make different conclusions, and point out some contaminant candidates.

The most noticeable case is the Avian Leukosis Virus, whose strains were consistently

found in more than 250 snumber samples. With all this information, a database

containing all viruses suspected of being contaminants could be made. This would be an

useful idea, since distinguish between contaminant entities and real disease causing

agents is a hard task for researches. It would not be the first time that a huge amount of

time and resources are spent in studying possible disease causing agent that is,

unfortunately, a contaminant. Nevertheless, these findings have not been published yet

and the data is still confidential, so this idea cannot be carried out for now.


Missing information

When the P. acnes plots were checked for the first time, snumber histograms appeared

as scattered dots, although several snumber reads seemed to fully cover the P. acnes

genomes. We realized that many histograms were not drawn by Circos, even when the

read depth data was correctly parsed. We then learned that we were not using Circos

software the way it was designed. The problem resided in that we did not parse the read

depth information in the best way, leading Circos to create a histogram per position of

the genome sequence. This is not a problem when the sequence is small, and there is

enough space inside the plot to display every histogram. However, if the length of the

genome is large (P. acnes genome size is 2,5 Mb approximately), not all histograms can

be displayed. We made a logical conclusion from this fact: if the size of an histogram is

smaller than the size of a pixel of the plot, the histogram cannot be drawn. Thus, the

solution was to create bigger histograms, that is, to include more sequence positions in

the histograms. Consequently, we made several modifications in the script that writes

the configuration files. The script now collapses consequent positions that have the

same read depth value, and therefore, bigger histograms can be drawn. The results of

this modifications can be seen in Figure 4, and they showed an increase in the number

of histograms drawn in the plot, enhancing visualization greatly.

Presence of P. acnes in cancer samples

The Circos plots showed evenly distributed reads, leading to almost full coverage of P.

acnes in many samples, proving the presence of this bacterium in this samples. On the

other hand, more sporadic coverage was observed in the rest of the samples, showing

lower levels of P. acnes.

P. acnes was present in many skin-related cancer samples. Antiseptic sterilization is a

medical routine in surgical procedures, like skin sample collection. Despite this

antiseptical treatment, not all microorganisms are killed, or affects anyhow to the

bacteria present in deep layers of the skin, like P. acnes13. For example, basal cell

carcinoma is the most frequent type of skin cancer, and they arise in the deepest layer of

the epidermis. If bacteria are not successfully removed before sampling, it is possible

that P. acnes will contaminate the sample. This hypothesis can be applied to breast and

vulva cancer, as well as melanomas.

The presence of P. acnes was also observed in optic neuritic plasma samples.

Propionibacterium acnes has been associated with optic neuropathies before31, and it

could be the responsible of the inflammation of the optic nerves in our samples.

However, we do not have access to the medical record of the sample patient, so we

cannot confirm this statement.

In summary, P. acnes is a widespread bacterium in the human skin microbiota, and it

has been associated to several medical conditions13. Nevertheless, its real participation

is still controversial since it is a well known contaminant of surgical wounds and

collected samples17. Thus, it is very difficult for researches to differentiate between

contamination and real disease-causing agent. In this particular study, we show the

abundant presence of this bacterium in several types of cancer samples, but we cannot

confirm that this microorganism participates in the development of cancer, since the

presence of P. acnes in this samples can be easily due to contamination. Nonetheless,

this findings points P. acnes as a highly suspicious entity, and should be studied

carefully using tissue-specific approaches to reveal its true role the described diseases.

Utility of Circos plots

Circos plots have proven to be an useful tool to analyze possible patterns of presence of

the GIs studied, although several points could be improved. Firstly, the size of the

ideograms and histograms are huge compared to the size of the text displayed. This

makes the manual inspection of hundreds of plots a harder task than expected, since

zooming in and out the plot consumes extra time. When this time is added up, it

becomes a relevant amount of effort. This problem could be partially solved if we could

create a list describing the order of the snumbers that appear in the plots. Unfortunately,

all configuration and pileup files were erased before this function was applied, and thus,

it would be necessary to repeat all the bioinformatics pipeline (which was not made).

Secondly, CDS labels and highlights start to overlap when the GI represented is

crowded of CDS regions in the same area. This is an obstacle when trying to identify

where do these CDS belong to. This problem increases when there is not enough space

in the plot to display all the labels in a region of the ideogram. Circos is not designed for

this purpose, so no solution is satisfactory for now.

Thirdly, we can still miss information when the GI represented exceed 1Mb, as it

happened with the sequences of P. acnes. As I discussed before, histograms are no

displayed if their size are smaller than the size of a pixel. One solution could be to

convert the read depth value into a binary value: presence or absence of the position. In

this case, we would still miss the coverage information, which is not the point of the

histograms. A less drastic solution would be to make intervals of the read depth value,

and represent in the histogram the mean value of the intervals. This way, the histogram

will be composed of a higher amount of positions of the sequence, and therefore, its size

will be bigger. Anyway, this option was not implemented, since our collaborators were

satisfied with the Circos plots.

Differences between BLASTn and BWA alignment results in the

virus genomes

Although we were able to find many viruses in the cancer samples, some of the virus

GIs had no reads mapped against them when using BWA. This would mean that these

viruses are not present in the cancer samples. However, all virus GIs could be found in

the BLASTn taxonomic assignation. This conflict can have two possible explanations:

(1) the GI-snumber read combinations did not meet the 200 base pair requirement, and

thus, no reads were aligned for those GI; (2) the algorithm differences between

BLASTn and BWA MEM produces different results. The BLASTn is a local aligner,

while BWA MEM attempts a semi-local alignment. In addition the BLASTn search was

performed with low restricted conditions, allowing to taxonomically assign a read to a

GI even if its similarity to the GI was low. On the other hand, BWA MEM is very

restrictive and only high similarity (nearly 100%) reads are count as matched.

A sample of all the viruses that could not be mapped was re-aligned without the 200

base pair threshold without no success. Thus, the algorithm differences between

BLASTn and BWA are probably the responsible of the contradictory results. A fast

check over the taxonomic characterization of a random sample of the unmapped reads

confirmed that these reads had low similarity values. These values are not high enough

for the BWA algorithm, and therefore, those reads could not be mapped.

Lack of negative controls

Healthy human tissues were not available from hospitals when the cancer samples were

collected, and consequently, there are no negative controls in this study. Thereby,

although we have found several GIs (P. acnes, Parvoviruses, Papillomaviruses, etc)

within the samples, we cannot prove that they are either associated with contamination

or cancer, since we had not healthy samples to compare with. Notwithstanding, other

arguments can be used to provide some highlights about which entities could be

possible contaminants. If a virus or bacteria is a contaminant, it should be found in a

high amount of different samples from different origins. In addition, the organism

should not be able to be associated with a specific group of cancer samples, whereas it

could be related with a specific technique or method. This can be tested by associating

the presence of an entity in either cancer samples or snumbers. In addition, this possible

contaminant has not been probably described before as a disease-causing agent. More

arguments could be made, but ultimately, if the entity studied meets these conditions is

a good candidate to be a contaminant.

Conclusion

In this study, we were able to observe the abundance of several taxonomical entities in

different cancer specimens. We proposed some viral contaminants and suggested other

many possible viral drivers of cancer. We also analyzed the consistent presence of

Propionibacterium acnes in these cancer samples. We concluded that this bacterium is

a highly suspicious organism and should be the objective of further studies. However,

since this study was performed without healthy controls, we cannot confirm our

hypothesis. Despite this fact, we believe that the arguments and results showed in this

study are valid enough to keep further investigations in this direction, as they could

eventually lead to the discovery of new diagnose methods or vaccines.

Acknowledgements

I would liketo thankmy colleagues at the Center for Biological Sequence Analysis, in

particular to Jens Friis-Nielsen, Marisa Matey and John Damm Sørensen for helping me

get started. To professor Søren Brunak for the opportunity to join the Integrative

Systems Biology group. I also thank our collaborators at the Center for Geogenetics: Dr.

Anders Johannes Hansen, Dr Kristin Rós Kjartansdóttir, Dr. Sarah Mollerup, Maria

Asplund and Dr. Tobias Mourier for their support and expertise.

I would like to especially thank my mentor José MG Izarzugaza for his guidance and

patience throughout this Master Thesis. This has been an unique experience that will

allow me to continue developing myself as a bioinformatician in Denmark.

Annexes. Supplemental Material

All Supplemental Material can be found in this link:

https://www.dropbox.com/sh/lz8gywt27jkym6h/AAD_n0ZOXlcBfSR4pJk4m4QUa?dl

=0

SM1. Table containing all the snumbers and their respective samples.

SM2. GI lists of virus contaminants, viral cancer drivers and P. acnes. Each list

describes whether the GI is plotted and a short description of the sequence.

SM3. A compressed RAR file with all the drawn plots, divided according to the list they

belong.

SM4. Python script that created and parsed the files necessary to draw the Circos plots.

SM5. Python script with three functions. (1) Use Circos software to produce plots in

PNG format. (2) Convert the PNG files into PDF files. (3) Merge all pdf files into a

single PDF file per GI.

References

1. zur Hausen, H. The search for infectious causes of human cancers: Where and why.

Virology 392, 1–10 (2009).

2. Akagi, K. et al. Genome-wide analysis of HPV integration in human cancers reveals

recurrent, focal genomic instability. Genome Res. 24, 185–199 (2014).

3. Araujo, A. & Hall, W. W. Human T-lymphotropic virus type II and neurological

disease. Ann. Neurol. 56, 10–19 (2004).

4. Coffin, J. M. Retroviruses. (CSHL Press, 1997).

5. Lynch, R. M., Shen, T., Gnanakaran, S. & Derdeyn, C. a. Appreciating HIV type 1

diversity: subtype differences in Env. AIDS Res. Hum. Retroviruses 25, 237–248

(2009).

6. Robertson, D. L. et al. HIV-1 nomenclature proposal. Science 288, 492–505 (2000).

7. Segata, N. et al. Computational meta'omics for microbial community studies. Mol.

Syst. Biol. 9, 1-15 (2013).

8. Allander, T., Emerson, S. U., Engle, R. E., Purcell, R. H. & Bukh, J. A virus

discovery method incorporating DNase treatment and its application to the

identification of two bovine parvovirus species. Proc. Natl. Acad. Sci. U. S. A. 98,

11609–11614 (2001).

9. Reyes, G. R. & Kim, J. P. Sequence-independent, single-primer amplification

(SISPA) of complex DNA populations. Mol. Cell. Probes 5, 473–481 (1991).

10. Depledge, D. P. et al. Specific Capture and Whole-Genome Sequencing of Viruses

from Clinical Samples. PLoS One 6, e27805 (2011).

11. Vinner, L. et al. Investigation of Human Cancers for Retrovirus by Low-Stringency

Target Enrichment and High-Throughput Sequencing. Sci. Rep. 5, 13201 (2015).

12. Zavala, G., Cheng, S., Zavala, G. & Cheng, S. Detection and Characterization of

Avian Leukosis Virus in Marek ’ s Disease Vaccines. Am. Assoc. Avian Pathol. 50,

209–215 (2006).

13. Perry, A. & Lambert, P. Propionibacterium acnes : infection beyond the skin. Expert

Rev. Anti. Infect. Ther. 9, 1149–1156 (2011).

14. Purdy, S. & de Berker, D. Acne vulgaris. Lancet 379, 361–372 (2012).

15. Liu, J. Draft Genome Sequences of Propionibacterium acnes Type Strain

ATCC6919 and Antibiotic-Resistant Strain HL411PA1. Genome Announc. 2, 6–7

(2014).

16. Holland, C. et al. Proteomic identification of secreted proteins of Propionibacterium

acnes. BMC Microbiol. 10, 230 (2010).

17. Rood, I. G. H., de Korte, D., Savelkoul, P. H. M. & Pettersson, A. Molecular

relatedness of Propionibacterium species isolated from blood products and on the

skin of blood donors. Transfusion 51, 2118–24 (2011)

18. Fassi Fehri, L. et al. Prevalence of Propionibacterium acnes in diseased prostates

and its inflammatory and transforming activity on prostate epithelial cells. Int. J.

Med. Microbiol. 301, 69–78 (2011).

19. Lysholm, F. et al. Characterization of the Viral Microbiome in Patients with Severe

Lower Respiratory Tract Infections, Using Metagenomic Sequencing. PLoS One 7

(2012).

20. Krzywinski, M. et al. Circos: An information aesthetic for comparative genomics.

Genome Res. 19, 1639–1645 (2009).

21. Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler

transform. Bioinformatics 25, 1754–60 (2009).

22. Li, H. Aligning sequence reads , clone sequences and assembly contigs with BWA-

MEM. Oxfor Univ. Press 00, 1–3 (2013).

23. Li, H. et al. The Sequence Alignment / Map format and SAMtools. Bioinformatics

25, 2078–2079 (2009).

24. Ehrengruber, M. U. & Goldin, A. L. Semliki Forest virus vectors with mutations in

the nonstructural protein 2 gene permit extended superinfection of neuronal and

non-neuronal cells. J. Neurovirol. 13, 353–63 (2007).

25. Janine, C., Johann, S. & Bernard, P. A Cryptic Transcription Promoter in the myb

Oncogene of Avian Myeloblastosis Virus. Virology 150, 252–259 (1986).

26. Gallian, P. et al. TT virus : a study of molecular epidemiology and transmission of

genotypes 1 , 2 and 3. 17, 43–49 (2000).

27. Kean, J. M., Rao, S., Wang, M. & Garcea, R. L. Seroepidemiology of Human

Polyomaviruses. J. Clin. Virol. 5, (2009).

28. Torelli, G. et al. Human herpesvirus-6 in human lymphomas: identification of

specific sequences in Hodgkin’s lymphomas by polymerase chain reaction. Blood

77, 2251–2258 (1991).

29. Väisänen, E. et al. Bufavirus in feces of patients with gastroenteritis, Finland.

Emerg. Infect. Dis. 20, 1077–80 (2014).

30. Kijpittayarit, S. & Razonable, R. R. JC Virus Infection After Transplantation:

Beyond the Classic Progressive Multifocal Leukoencephalopathy? Gastroenterol.

Hepatol. 3, 74–76 (2007).

31. Adesina, O. et al. Optic Neuropathy Caused by Propionibacterium acnes

Pachymeningitis. J. Neuroophthalmol. 1–4 (2014).

32. Mollerup S. et al. Propionibacterium acnes – disease causing agent or common

contaminant? Detection in diverse patient samples by next generation sequencing.

Submitted to J. Clin. Microbiol. (2016).

identification and characterization of non … · identification and characterization of non-human...

Documents