identification and characterization of non … · identification and characterization of non-human...
TRANSCRIPT
IDENTIFICATION AND
CHARACTERIZATION OF NON-HUMAN
ENTITIES IN CANCER SAMPLES
Estudiante: José Alejandro Romero Herrera
MÁSTER EN BIOINFORMÁTICA Y BIOLOGÍA COMPUTACIONAL
ESCUELA NACIONAL DE SALUD- INSTITUTO DE SALUD CARLOS III
2014-2015
Center for Biological Sequence Analysis (CBS).Lyngby, Dinamarca
Supervisor: Assoc. Prof. José MG Izarzugaza, PhD
FECHA: 02/2016
Index
Index ................................................................................................................................. 2
Abstract ......................................................................................................................... 3
Objectives ..................................................................................................................... 3
Introduction .................................................................................................................. 4
Cancer and Pathogen project and the importance of virus in cancer onset. ............. 4
Virus contamination in cancer samples .................................................................... 5
Detection of P. acnes in patient samples .................................................................. 6
Sample preparation and pre-processing .................................................................... 6
Computerome: High Performance Computing at DTU ............................................ 7
Bioinformatics pipeline ............................................................................................ 7
Materials and methods .................................................................................................. 9
Samples and reads .................................................................................................... 9
Datasets of reference genomes. .............................................................................. 10
Alignment and mapping ......................................................................................... 11
Circos plots ............................................................................................................. 11
Script modification ................................................................................................. 12
Results ........................................................................................................................ 14
Alignment and mapping with BWA ....................................................................... 14
Viruses .................................................................................................................... 14
Propionibacterium acnes ........................................................................................ 14
Circos plots ............................................................................................................. 15
Viral contamination ................................................................................................ 17
Discussion ................................................................................................................... 21
Viral cancer drivers ................................................................................................ 21
Virus contamination ............................................................................................... 21
Propionibacterium acnes ........................................................................................ 23
Utility of Circos plots ............................................................................................. 24
Differences between BLASTn and BWA alignment results in the virus genomes 25
Lack of negative controls ....................................................................................... 25
Conclusion .................................................................................................................. 26
Acknowledgements .................................................................................................... 26
Annexes. Supplemental Material ................................................................................ 27
References .................................................................................................................. 27
Abstract
It has been calculated that almost one fifth of cancer cases are caused by viral and
microbial entities, but only a few viruses are known to trigger cancer onset. However,
there are many mammal lymphoma-proliferative diseases caused by retroviruses, and
thus, it is very probable that there are still many unknown viruses related to cancer.
High-throughput sequencing techniques allow researchers to find and detect the
presence of these organisms, even if they constitute a minor fraction of the nucleotidic
content of the infected cell. Unfortunately, The high sensitivity of current
methodologies is capable of detecting not only the main entities present in the sample
but also other minority species, sometimes, spurious contaminants present in the sample
or the laboratory reagents used. Therefore, caution should be exerted when establishing
causative associations to disease. For example, Propionibacterium acnes (P. acnes) is a
bacterium that has been associated to several medical conditions, but since it is a known
contaminant of surgical wounds and tools, its role in these diseases is highly
controversial. Thus, identifying organisms that are contaminants of biological samples
would ease this type of research. This Master Thesis focuses in the taxonomical
characterization of the species present in 900 cancer specimens collected in different
Danish hospitals. Our results show the presence of several oncoviruses as well as other
viral contaminants in these cancer samples. We also confirm the presence of P. acnes
among these samples, and discuss its possible association either to cancer or
contamination.
Objectives
- Improve the understanding of common bioinformatics tools. In particular those used in
the analysis of biological sequences and next generation sequencing data.
- Identify possible oncoviruses in NGS data from cancer patient samples
- Identify recurrent contaminant of viral origin in those samples.
- Study the implication of P. acnes in cancer onset and evaluate its suggested role as
contaminant.
Introduction
Cancer and Pathogen project and the importance of virus in
cancer onset.
It has been estimated that viral and microbial infections play an essential role in cancer
development in almost one fifth of cancer cases (18.6%)1. There are many important
viral entities among these infections such as human papillomaviruses (HPV)2, Epstein-
Barr virus (EBV) and hepatitis B and C virus (HBV and HCV, respectively).
Retroviruses are also involved in cancer onset, but many times their relation to this
disease is still discussed3. Nevertheless, there are several animal lympho-proliferative
diseases, such as leukemia or lymphoma, caused by retroviruses4. Thus, it is very likely
that there are unknown viruses related to human cancer and to lympho-proliferative
diseases. Since only a few prophylactic vaccines are yet available, the detection of such
pathogens is challenging and necessary. The Cancer and Pathogen project was designed
in order to identify unknown viruses and microorganisms in cancer samples, and
recognize those associated with cancer onset. This project is part of the
GenomeDenmark platform, and uses DNA and RNA purification techniques, as well as
sequencing and several bioinformatic approaches. The present Master Thesis is
developed within The Cancer and Pathogen project and focuses on the characterization
and identification of entities in cancer specimens, paying special attention to the role of
viral species. In addition, I also explore the possible association of Propionibacterium
acnes (P. acnes) to cancer development. This way, new diagnose methods and anti-
cancer vaccines could be discovered from the selected microorganisms, demonstrating
that genomics is a technology that might achieve this goal relatively fast.
There are, however, several obstacles in viral discovery studies like ours. First, the
proportion of host-derived genetic material (human in our case) is usually many times
bigger compared to viral nucleic acids in a cancer sample. Viral genomes are relatively
small, and hence constitute a minor fraction of the genome of the infected host cell. In
addition, the infected cells could be only a proportion of the whole sample and they may
contain a low number of viral genome copies.
On the other hand, extensive variation in virus sequences makes viral classification a
challenge. For example, human immunodeficiency virus 1 (HIV-1) inter-subtype
sequence varies more than 35% in the env gene5 and more than 10% in the gag and pol
genes6. This problem increases when additional species-specific genes contribute to
sequence variation between virus species. This is especially relevant when retrieving
genomic material from unknown species, since amplification of variant sequences using
standard levels of PCR stringency would prevent enough duplication of viral nucleic
acids.
Fortunately, there are currently high-specific methods to overcome these obstacles.
High-throughput sequence-independent shotgun sequencing can be used to improve
sensitive detection of unknown viral sequences, since it allows the comprehensive
sampling of the whole set of sequences in the pool of organisms present in a given
complex sample. This method enables the evaluation of genome diversity and provides
a way to study microorganisms that are otherwise difficult or impossible to analyze in
vitro7. In addition, lower stringency levels of PCR can be applied when the
amplification of variant sequences poses a challenge. The quantitative disproportion
between host and viral genomic material can be undertaken by mechanical and
enzymatic procedures8. These procedures are based on DNase treatment followed by
restriction enzyme digestion and sequence-independent single primer amplification
(SISPA9) of the fragments. However, the DNase-SISPA method is not viable for
studying integrated or episomal viral nucleic acids. Instead, target capture (target
enrichment by hybridization10) can be used for enrichment of high-throughput
sequencing libraries11, and would be a more adequate approach for our study.
Virus contamination in cancer samples
Ultimately, these methods provide the ability to recover almost every possible sequence
contained in the sample. Identifying those entities that are actually related to cancer
onset from those that are not is a challenging step. In fact, these methods are so
sensitive that it is possible to detect in biological samples viral entities that neither
oncoviruses, neither related somehow to human cancer development. Several of these
viruses should not be found in humans (e.g Avian leukosis virus12). Contamination of
samples is the most reasonable explanation for these findings. The use of laboratory
reagents and materials, as well as human errors, might be the cause of sample
contamination. We agreed with our collaborators at CGG on a list of 159 reference
genomes (GI entries) that are known contaminants of samples, lab kits and reagents. In
this study, the abundance of these genomes will be analyzed and discussed.
Detection of P. acnes in patient samples
Propionibacterium acnes is a Gram-positive bacterium that inhabits the healthy flora of
the skin, oral cavity, large intestine, the conjunctiva and external ear canal13. P. acne's
role in the pathophysiological mechanisms of acne is already proposed14 and its genome
well characterized. A study has shown several genes that can generate enzymes for
degrading skin and proteins that may be immunogenic15. Furthermore, rapid growth of
P. acnes can produce cellular damage and metabolic products, triggering inflammation
and other medical symptoms16.
During the last years, an increasing number of reports have implicated the bacterium as
an opportunistic pathogen responsible for a wide range of disease conditions13, such as
cerebrospinal shunts, ocular infections, sarcoidosis and prostate cancer. Contrarily,
some studies postulate that P. acnes is a contaminant of surgical wounds, blood
products and tissue cultures13,17. Therefore, the speculated role of P. acnes in prostate
cancer18, among other diseases, is still highly controversial.
Knowing that this bacterium is a candidate suspected of causing cancer, we also
investigated the presence of 12 P. acnes strains among the cancer samples.
Sample preparation and pre-processing
For the purpose of this study, we analyzed around 900 samples collected in different
Danish hospitals. Specimens included a wide range of cell types: acute myeloid
leukaemia, basal cell carcinoma, breast cancer, colon cancer, vulva cancer, etc.
Samples and sequencing libraries were prepared by our partners at the Center for
GeoGenetics (CGG). This preparation was performed using different techniques, such
as virion or microbial enrichment and shotgun sequencing or target capture with specific
probes (articles submitted). Thus, each sample could be analyzed with more than one
technique, giving rise to more than 1300 combinations. These combinations generated
independent libraries
On the other hand, library sequencing was performed by our partners at the Copenhagen
outstation of the Beijing Genome Institute (BGI). Each one of those libraries was
sequenced independently, making a total number of more than 4x109 reads. This huge
amount of data could not be processed in a normal computer. In order to analyze this
data, a supercomputer was used.
Computerome: High Performance Computing at DTU
Computerome is the new Danish supercomputer for life sciences that connects several
universities and institutes among the world. It is composed of 16048 cores, with a total
of 96TB of global RAM memory (DDR4). Jobs can be sent to the Unix supercomputer,
and they can perform commands or self-made scripts. Each job would require different
configuration of RAM, processing units (nodes, CPUs) and software modules. A task
manager supervises the system by controlling the execution and the allocation of shared
resources, in order to avoid the supercomputer overuse and collapse. Furthermore, the
availability of resources allows for parallelization of analyses, which constitutes a
substantial gain in time for a requiring project as ours.
Bioinformatics pipeline
The bioinformatic analysis performed on the raw paired-end sequencing data is similar
to the one performed by Lysholmet al. (2012)19. First, reads were first pre-processed
and filtered. Second, reads were devoid of human sequences. This step is necessary
because we are only interested in the non-human fraction of the samples. Third, human-
depleted reads were queried against the NCBI nucleotide database with a low restrictive
BLASTn search to provide their taxonomic characterization. This way, the highest
amount of reads would be taxonomically assigned to an organism. The obtained
BLASTn hits were analyzed and different Genome Identifiers (GIs) could be pointed
out. These GIs were used to study their abundance in the cancer samples. For this
purpose, reads were mapped against these GIs.
The raw results obtained in this last step are huge and consist only of large text data
classified by their reference genome. Trying to understand this type of information
would be complex and arduous, and manually inspection of the files would be a very
time-consuming task. Parsing the data files into coverage information of the reference
genomes would be an easy way to inspect the presence of a given reference genome
among the samples. Nevertheless, the coverage percentage of the reference genomes
would not show which regions of the reference genome are actually present in the
samples. This type of information can be very important because it could show some
kind of patterns among the samples. For example, if a region of a reference genome is
persistently found in all samples, it could be interesting to make further analysis that
could ultimately lead to potential discoveries .
This way, although data text can be processed and
resumed in the form of coverage percentage, it would
be still a poor representation of all the retrieved
information. Therefore, another computational
approach must be taken. Visualization and graphic
tools, such as plots, might be a more adequate way to
display our data. Graphic plots can show many times
more information, and it would be easy to read. These
plots could represent the coverage count (how many
times a position of the genome is read) for each
reference and for each sample.
The visualization tool chosen in this study is Circos20. Circos is a flexible software
package that can display data in an attractive circular layout, making it ideal to explore
connections between objects or positions of a genome (Figure 1). In addition, Circos
can be automated and can be perfectly incorporated into our pipeline, creating plots in
PNG format. Since our data can be easily parsed and the Circos software offers many
different options, additional data can be displayed. For example, it could be interesting
to see which positions of the reference genome correspond to coding DNA sequences
(CDS) regions. This could help to identify patterns or reveal new information, rather
than just providing coverage of the reference sequence. For this purpose, we can use
feature tables for each reference provided by the NCBI database.
This way, the third step of this pipeline is the processing and visualization of the
mapped human-depleted fraction of the reads in the form of Circos plots. Finally, these
plots can be analyzed and conclusions can be made. A flow chart describing this
pipeline can be found in Figure 2.
Figure 1. Graphical uses of
Circos plots.
Materials and methods
Samples and reads
Human cancer specimens were obtained from different Danish hospitals. Sequencing
libraries were prepared by CGG and were assigned a number, internally termed
"snumber". A table containing all the snumbers and their respective samples can be
found in the Supplemental Material. Several snumbers might represent the same sample
if they gave rise to different sequencing libraries, i.e. if a different treatment was applied
in the wet lab. Sequencing was performed at BGI on Illumina Hiseq 2000 platform as
100 bp paired-end reads.
Pre-processing of raw reads was performed at CBS. First, paired-end reads were pre-
processed to eliminate remaining sequencing adaptors and filtered for low quality
sequencing using AdaptorRemoval. In addition, overlapping pairs were collapsed into a
Figure 2. Flow chart that represents all steps of the study. (A) Sample preparation
and read pre-processing made at CGG and CBS, respectively (B) Bioinformatics
pipeline developed in this thesis.
single read. Second, sequencing libraries were devoid of human sequences. Reads were
mapped to the version hg38 of the human genome using BWA MEM alignment
algorithm21, 22, which is ideal for short reads and long sequences up to few megabases.
Reads showing similarity to the human reference were discarded and the remaining
progressed for further analysis. However, if only one read of a pair belonged to the non-
human fraction, this proceeded as a singleton. Finally, low complexity regions were
filtered out using the DustMasker algorithm. All read files were in FASTQ format.
Third, the human-devoid reads were queried against the NCBI non-redundant
nucleotide database. BLASTn hits targeting the same organism were considered
together and the percentage of the organism covered by at least one BLASTn hit
computed. In order to consider a BLASTn hit, the similarity had to be greater or equal
than 20% and the e-value smaller than 0.001. Those that were not characterized were
assigned to a "Dark Matter" group.
Datasets of reference genomes.
The analysis of the BLASTn results suggested three different types of organisms,
characterized by their Genome Identifiers (GI): 159 GIs of known contaminants, 90 GIs
of possible viral cancer drivers, and 12 GIs of Propionibacterium acnes complete
genomes. Different types of viral sequences composed the first two lists. These could be
either full genomes, partial genomes or just gene regions of the virus. These are a few
virus examples from the contamination list: Avian leukosis virus, Avian myeloblastosis-
associated virus, Parvo-like hybrid virus, Acanthamoeba mimivirus, etc. Different
flavours of the Avian leukosis virus constitute the most abundant organism in this list
(56 GIs), followed by Parvo-like viruses and Propionibacterium phages (11 GIs each),
and Avian myelobastosis virus (7 GIs).
On the other hand, some viruses could be found in the cancer driver list. These include,
Human adenoviruses, Human herpesvirus, and Human papillomavirus, etc. All lists are
annexed in the Supplemental Material.
All reference genomes were named after their GI. This way, the reference genomes
could be also referred as GI indistinctly. FASTA sequences and feature tables were
downloaded from the NCBI database, using NCBI Batch Entrez website
(http://www.ncbi.nlm.nih.gov/sites/batchentrez).
Alignment and mapping
For the virus genomes, not every read were aligned to all the GIs. In order to reduce the
use of computer resources, a cutoff of a minimum 200 base-pair was set. Only those
read-GI combinations that matched this requirement in the BLASTn read taxonomic
assignation were aligned.
Alignment of the reads to the reference genomes was performed using Burrows-
Wheeler Aligner (BWA) MEM algorithm. Firstly, FASTA sequences from the
reference genomes were treated as databases and indexed using BWA index command:
bwa index -a is (1)
The algorithm used for indexation was IS. IS is the default algorithm and works
perfectly with our small databases of GI sequences. All output files were classified first
by their GI and then by their snumber. SAM files were generated as output, and were
transformed into BAM files using SAMtools23 "view" and “sort”. The command used
for this purpose was:
bwa mem -M -t 16 (1) (2) (3) | samtools view -Sbh -F 4 - | samtools sort - (4)
This way, BAM files were automatically created. (1) is the reference genome database.
(2) is the read fq file, and (3) is the mate fq read file, if it is not a singleton. (4) is the
output name of the BAM file. In order to economize disk space, only correctly mapped
reads were saved (samtools view -F 4 option)
BAM files were merged and mapped to the references with SAMtools "mpileup". The
command used for this purpose was:
samtools mpileup -f (1) -B -Q0 -d 10000000 (n) > (2).pileup
As a result, a single text file in Pileup format were created for each GI-snumber
combination. (1) is the reference genome database. (n) are all reads files of a given
snumber. (2) is the output name of the pileup file.
Circos plots
Circos software uses configuration files in order to draw a plot. These configuration
files can contain different variables that will describe the plot configuration: plot size,
text size, colors, etc. The configuration files also point out paths to the parsed data files,
whose information will be displayed in the plot.
A python script was made to write the Circos configuration files that are necessary to
draw the plots. The program saved all configuration and data files of a given GI in a
new folder that would also contain the images that were created afterwards. This script
also parsed Feature tables and GI FASTA files in order to be used by the Circos
software. All GI sequences and their features were drawn as an highlighted ideogram.
For each GI, the script calculated the accumulated read depth (coverage) of all sample
numbers (snumber). It iterated through all the positions of the GI, adding all read count
values of all snumber reads in each position. Accumulated read depth from all samples
was displayed in every plot as a green histogram.
Read depths for every position of the GI were showed as orange histograms and they
were labeled by their snumber. A maximum of 9 snumbers were displayed per plot, so
several plots could be made from a single GI. In short, each plot containeda highlighted
ideogram of a given GI, an accumulated green histogram from all snumbers that were
mapped against that GI, and a maximum of 9 snumber orange histograms. The script
made more plots if a GI was mapped with more than 9 snumber reads,. Consequently,
more than one plot could be made for a single GI.
Another script was written in order to automatically produce PDF files of the plots. This
script had three functions: (1) use the Circos software to draw the plots in PNG format,
(2) convert PNG files into PDF files, (3) merge all PDF files of a GI into a single PDF
file.
All merged pdf files were named after their GI, with a "_total.pdf" suffix. All merged
pdf files can be found in the Supplemental Material.
The Python scripts are annotated and can be found in the Supplemental Materials.
Script modification
The script used for writing the configuration files was modified in order to increase the
size of the histograms plotted by the Circos software. The script now iterates through all
positions of the reference genome and collapse those positions that have the same read
depth value. This way, the histograms can include more positions, and this, they are
bigger, instead of creating one histogram per position. In addition, the script now forces
the Circos software to increase the number of pixels in the png file from 1000x1000 to
3000x3000 pixels.
Ultimately, these changes increased the number of positions represented in the plots
when the size of the genome studied is large.
Figure 3. A Circos plot example. This plot shows the proviral sequence of the Avian
leukosis virus (GI:14324120). In this case, the reference genome (outer ideogram) is
widely covered (except the envelop protein gene) by the accumulated read count
(green histogram), and the different snumber reads (inner histograms). CDS regions
are labeled in the outer radius of the ideogram, and highlighted inside of it.
Results
Alignment and mapping with BWA
Viruses
All virus GIs were downloaded from the NCBI database and prepared for alignment
with the reads. Only those read-GI combinations that met the 200 base pair requirement
were aligned. BAM files for each GI were created and used for mapping with SAMtools
"mpileup", producing their respective pileup text file. Some pileups were empty,
meaning that none of the reads matched against the reference genome. These files were
marked and discarded.
Reads mapped successfully against 42 of the 90 (~47%) GIs from the list of possible
cancer drivers. Some GIs, like those corresponding to Human parvoviruses, had a high
amount of reads mapped from several sequencing libraries, but in most of the cases, the
GIs had a low amount of reads mapping to them.
Concomitantly, 115 of 159 (~72%) GIs from the contamination list had at least one read
mapped to them. In this case, the majority of these GIs had a huge amount of reads
mapped from different samples, such as Avian leukosis virus GIs or Parvo-like virus
GIs, compared to the previous virus list.
It is noticeable that some of the GIs could not be mapped, when all of them were found
previously found in the BLASTn search. This conflict lead us to repeat the alignment of
the reads against the reference genomes. A random sample of all virus GIs that could
not be mapped was re-aligned with BWA MEM algorithm, shutting off the 200 base
pair threshold. Unfortunately, none of the reads could be mapped. We believe that the
difference originates from the different tolerance to changes between the two
methodologies. BLASTn allows for exploration of distantly related sequences, whereas
BWA is optimized for the rapid identification of similar sequences respect to a
reference genome.
Propionibacterium acnes
12 P. acnes complete genomes were aligned and mapped against the selected snumber
human-depleted reads. Pileup files were created and checked as it was done with the
virus genomes. Again, some pileup files were empty, and thus, marked and discarded.
All genomes had reads successfully mapped, so all the GIs could be plotted.
Circos plots
Circos plots were drawn using their respective configuration files, and final pdf files
were created. This way, for a given GI a pdf will contain one or more plots. Each plot
has a minimum resolution of 1000x1000 pixels, and a maximum of 3000x3000 pixels,
depending on the size of the GI.
Each plot consists of a circular ideogram that represents the sequence of the GI (Figure
3). The ideogram is filled with the CDS regions parsed from the GI feature table. These
CDS regions are highlighted with different colors to facilitate visualization. Each CDS
product name was labeled in the outer radius of the circular plot. In addition, ticks were
added to display a base pair scale, making easier to recognize regions of the sequence.
In the inner area of the plot there were represented several histograms. The green
histogram corresponds to the accumulated value of the read count for each position of
the sequence. This green histogram is unique and is the same for all the plots of a given
GI. Orange histograms represent the depth of coverage for each snumber at each
position (binned for representation). The histogram maximum height is the highest read
count value found in the accumulated histogram. It is important to note that even if
some snumber histograms are just thin lines, they still show full read coverage of the
reference genome, which would mean that this genomic sequence is present in that
snumber.
Virus suspected of being cancer drivers
48 virus GIs were successfully plotted. As expected from the mapping of the reads, the
reference genomes had a low amount of mappedreads. However, some other viruses
could be more abundant among the samples. For example, John Cunningham virus (JC
virus) was present in some bladder cancer urine samples (Figure 4.a), and the Human
parvovirus B19 (GI 291045153) was present in melanomas and in basal cell carcinoma
samples (Figure 4.b). Human papillomaviruses were observed either in melanoma (GI
238623442) or in oral cavity samples (GI 238623442). On the other hand, only two of
five Human adenovirus strains complete genomes (GI317140354 and 602243850) were
present in cancer samples, whereas the hexon gene of different Human adenovirus
strains was consistently found in many cancer samples. It is noticeable that the fiber
protein genes of the complete genomes of the Human adenovirus strains were missing
(Figure 4.c). One of four Human herpesvirus types (Human herpesvirus 6, GI
587652027) was found in mycosis fungoides samples (Figure 4.d), which are the most
common form of cutaneous T lymphomas. Merkel cell polyomavirus (GI 560940486)
was observed in several melanoma samples.
Figure 4. (A) John Cunninham virus complete genome (GI 2246606). The read
coverage value is high for almost every position .(B) Human parvovirus complete
genome (GI 291045153). More sporadic reads can be observed.(C) Complete
genome of Human adenovirus C strain DD28 (GI 602243850). The fiber protein gene
is not present (marked zone). (D) Complete genome of Human herpesvirus 6A isolate
GS complete genome. This strain seems to be present in some snumber samples,
although the reads are spread along the sequence.
A Bufavirus-2 strain (GI 395318841) seemed
to be present in two melanoma snumber
samples, since several reads were
distributively mapped throughout the genome
(Figure 5).
For the rest of the viruses, only one or two
plots were made (less than 18 snumbers). In
addition, reads showed very low coverage in
most of the cases, which would mean that
those sequences were not present in the
sample. This fact reduced the list of possible
candidates to a few amount of viruses.
Viral contamination
115 GIs from viruses were successfully plotted. In general, a high amount of snumbers
per reference genome could be observed. In many cases the snumbers reads can almost
fully cover their reference genome, which would mean that the studied GI is actually in
that snumber sample. This was observed in most of Avian leukosis and Parvo-like
viruses.
In addition, some interesting patterns could
be observed. For example, the genome of
Avian leukosis virus were consistently found
in many samples (Figure 3). Surprisingly, its
envelop protein gene was missing in most of
the cases. This pattern is also found in Rous
Sarcoma Virus: the virus could be found in
many samples, but its tyrosine kinase gene
was missing (Figure 6).
Some particular observations were made, for
example, the GI sequences 145314587 and
208928. GI 145314587 corresponds to a
Semliki Forest virus expression vector
Figure 5. Complete CDS regions of .
Bufavirus-2 strain BF.39 NS1 putative
VP1, an hypothetical protein, and VP2
genes. Although it is not fully covered,
the distribution of the reads could mean
that it is actually present in the samples.
Figure 6. Rous Sarcoma Virus (GI
3003000). Almost all its genome is
covered in most of the snumbers.
However, its tyrosine kinase gene (in
light blue color) is missing.
(Figure 7.a), which was used in a neurobiological study24. This vector includes a unique
CDS annotated region (a non-structural protein) that was not covered by the reads,
whereas the rest of the vector was present in basal cell carcinome and in B-lymphoma
cell lines. On the other hand, GI 208928 corresponds to a synthetic construct used to
study the promoter activity of the regulatory signals contained in the v-myb oncogene of
Avian myeloblastosis virus25. This regulatory signals were inserted upstream to the
herpes simplex type 1 tymidine kinase gene. This GI can be observed in more than 150
snumber samples, but it is not fully covered. The tymidine kinase gene was not mapped
by any of the samples, although a small region at the start of this gene seemed to be
more or less covered, depending on the snumber observed (Figure 7.b).
Torque Teno Virus (TTV) and SEN virus were found in many Chronic Myelogenous
Leukemia (CML) samples. These viruses are known for being spread worldwide, and
have been detected in different types of biological samples, such as blood, saliva and
feces24, which would explain their presence in CML samples.
Nevertheless, there were some exceptions. For example, Acanthamoeba viruses
appeared to be sporadically covered, with very few reads mapped along the sequence.
This also happened in virus families like Megaviridae and Mimiviridae (Figure 8). All
Figure 7. (A) Semliki Forest virus expression vector (GI 145314587). Only the actual
vector is present in the samples. Red CDS region corresponds to a non-structural
protein. (B) Avian myeloblastosis virus v-myb oncogene promoter region fused to
HSV-1 thymidine kinase gene (GI 208928). Red CDS region corresponds to the kinase
gene.
these viruses are known large DNA viruses, and their complexity might be reason why
they could not be found in these samples.
Propionibacterium acnes
All 12 complete genomes of P. acnes had, at least, one snumber read correctly mapped,
and were then successfully plotted. In these plots, a scattered pattern and dispersed
distribution was observed along the snumber histograms, even in those that seemed to
fully cover the reference genome. In order to check if this scattered pattern is due to a
Circos error, we compared which positions are actually represented in the plots and
which positions are supposed to be represented. This test showed that many positions
were missing, so the script that parsed the configuration files was modified, and the
plots were repeated (more details in Methods). The number of the sequence positions
represented by histograms in the old and new plots were compared using the Circos
debugging feature. The new plots displayed an average of ~121% more positions than
the previous ones, and almost 219% more positions in some cases (Figure 9).
Figure 8. Large DNA viruses were not present in the samples. (A) Complete genome
of Acanthamoeba castellanii mamavirus strain Hal-V (GI 351737110). (B) Complete
genome of Megavirus chiliensis (GI 350610932). A few sporadic reads can be
observed. These plots also show the high complexity of these genomes.
Most importantly, two or three highly disproportioned read depth regions were observed
in the plots, depending on the strain studied (Figure 10). The ideograms were white in
these exact positions, which meant that there were no CDS features. A check to the
feature tables of the P. acnes strains revealed that these positions belong to ribosomal
RNAs 5S, 16S and 23S.
The plots showed that reads covered P. acnes
genomes in skin-related samples like breast
and oral cavity cancer, myelomas, melanomas
and basal cell carcinomas, and also in optic
neuritis samples. In contrast, there were small
or inexistent traces of P. acnes in many other
sample types, like blood samples, pancreatic
and colon cancer.
Figure 9. Improvement of the plot quality for Propionibacterium acnes plots. (A)
Before collapsing consecutive positions with the same read depth value. (B) After
collapsing consecutive positions with the same read depth value. The quality of the plot
and the number of histograms displayed increased greatly.
Figure 10. Some accumulated regions
were observed in all P. acnes
genomes
Discussion
Viral cancer drivers
We observed the presence of some known viruses related to cancer development like
Human papillomavirus and Merkel cell polyomavirus. These viruses acted like a sort of
positive controls, since they should be more probable to find in cancer samples. Human
papillomavirus was found in oral and melanoma samples, which makes sense since it is
known for being able to infect keratinocytes of the skin or mucous membranes. Merkel
cell polyomavirus was also found in melanoma samples. This virus is suspected to
cause the 80% of Merkel Cell carcinomas (a rare and aggressive form of skin cancer)27,
so it is reasonable to find this virus in skin-related samples. Human herpesvirus 6 was
found in cancer samples, specifically, mycosis fungoides samples. This type of
herpesvirus has already been associated to Hodgkin's lymphomas28,and its presence it is
not a surprise.
The presence of Bufavirus in melanoma samples was not expected since it has mostly
been found in samples of feces of severe diarrhea patients29. This virus could be a new
candidate for further studies and check its true relation to the melanoma samples.
We also observed the presence of the JC virus in bladder cancer urine samples. JC virus
is a type of human polyomavirus that can infect kidney epithelial cells30, which would
explain its presence in the urine samples. As far as we know, this virus has not been
related to bladder cancer, and it could be another interesting candidate for further
studies.
Nevertheless, half of the GIs studied could not be mapped by the BWA aligner. In
addition,many of the remaining GIs were poorly covered, and therefore, they were not
present in the samples. However, this result was expected, since it has been estimated
that only the 20% of the cancer cases are caused by viruses and other microorganisms.
Virus contamination
From a total of 159 GI sequences, 115 had at least one snumber reads correctly mapped.
Different GIs representing the Avian Leukosis virus constituted the most abundant
sequences in the contaminant list of viruses, but this fact could bias their findings. Other
viruses, such as Rous sarcoma virus and some Parvoviridae virus, were also present in
many different cancer samples. Curiously, these viruses had in common that some of
their genes were not mapped, although no explanation was found. On the other hand,
SEN virus and the Torque Teno Virus could also be perfect candidates for contaminant
entities.
We could also observe that some GI sequences were only partially covered because they
were synthetic constructs, instead of real viruses. In these cases, only the non-synthetic
part of the sequences were actually present in the samples. We suggest that the list of
GIs should be carefully checked before analyzing all the data. This would avoid the use
of human and computer resources, and only the interested sequences would be studied.
Moreover, some of these sequences corresponded to partial genomes or just single
genes. The presence of partial genomes or single genes in the samples would not tell
much about the real abundance of the actual organism, and thus, they were not as
relevant as the complete genome sequences. Furthermore, not all of these sequences
could be found among the samples, since the reads did not fully (or almost fully) cover
the reference genome. It should be noted that we might had not found many of the Gis
because their sequences can be very variable and complex, and therefore, the reads did
not mapped against them. Due to the fact that the BWA alignment algorithms were very
strict, this idea is conceivable, and could be the case of large DNA virus families like
Mamaviridae and Megaviridae.
Despite these obstacles, the studied sequences composed a wide variety of viruses. This
allowed us to make different conclusions, and point out some contaminant candidates.
The most noticeable case is the Avian Leukosis Virus, whose strains were consistently
found in more than 250 snumber samples. With all this information, a database
containing all viruses suspected of being contaminants could be made. This would be an
useful idea, since distinguish between contaminant entities and real disease causing
agents is a hard task for researches. It would not be the first time that a huge amount of
time and resources are spent in studying possible disease causing agent that is,
unfortunately, a contaminant. Nevertheless, these findings have not been published yet
and the data is still confidential, so this idea cannot be carried out for now.
Propionibacterium acnes
Missing information
When the P. acnes plots were checked for the first time, snumber histograms appeared
as scattered dots, although several snumber reads seemed to fully cover the P. acnes
genomes. We realized that many histograms were not drawn by Circos, even when the
read depth data was correctly parsed. We then learned that we were not using Circos
software the way it was designed. The problem resided in that we did not parse the read
depth information in the best way, leading Circos to create a histogram per position of
the genome sequence. This is not a problem when the sequence is small, and there is
enough space inside the plot to display every histogram. However, if the length of the
genome is large (P. acnes genome size is 2,5 Mb approximately), not all histograms can
be displayed. We made a logical conclusion from this fact: if the size of an histogram is
smaller than the size of a pixel of the plot, the histogram cannot be drawn. Thus, the
solution was to create bigger histograms, that is, to include more sequence positions in
the histograms. Consequently, we made several modifications in the script that writes
the configuration files. The script now collapses consequent positions that have the
same read depth value, and therefore, bigger histograms can be drawn. The results of
this modifications can be seen in Figure 4, and they showed an increase in the number
of histograms drawn in the plot, enhancing visualization greatly.
Presence of P. acnes in cancer samples
The Circos plots showed evenly distributed reads, leading to almost full coverage of P.
acnes in many samples, proving the presence of this bacterium in this samples. On the
other hand, more sporadic coverage was observed in the rest of the samples, showing
lower levels of P. acnes.
P. acnes was present in many skin-related cancer samples. Antiseptic sterilization is a
medical routine in surgical procedures, like skin sample collection. Despite this
antiseptical treatment, not all microorganisms are killed, or affects anyhow to the
bacteria present in deep layers of the skin, like P. acnes13. For example, basal cell
carcinoma is the most frequent type of skin cancer, and they arise in the deepest layer of
the epidermis. If bacteria are not successfully removed before sampling, it is possible
that P. acnes will contaminate the sample. This hypothesis can be applied to breast and
vulva cancer, as well as melanomas.
The presence of P. acnes was also observed in optic neuritic plasma samples.
Propionibacterium acnes has been associated with optic neuropathies before31, and it
could be the responsible of the inflammation of the optic nerves in our samples.
However, we do not have access to the medical record of the sample patient, so we
cannot confirm this statement.
In summary, P. acnes is a widespread bacterium in the human skin microbiota, and it
has been associated to several medical conditions13. Nevertheless, its real participation
is still controversial since it is a well known contaminant of surgical wounds and
collected samples17. Thus, it is very difficult for researches to differentiate between
contamination and real disease-causing agent. In this particular study, we show the
abundant presence of this bacterium in several types of cancer samples, but we cannot
confirm that this microorganism participates in the development of cancer, since the
presence of P. acnes in this samples can be easily due to contamination. Nonetheless,
this findings points P. acnes as a highly suspicious entity, and should be studied
carefully using tissue-specific approaches to reveal its true role the described diseases.
Utility of Circos plots
Circos plots have proven to be an useful tool to analyze possible patterns of presence of
the GIs studied, although several points could be improved. Firstly, the size of the
ideograms and histograms are huge compared to the size of the text displayed. This
makes the manual inspection of hundreds of plots a harder task than expected, since
zooming in and out the plot consumes extra time. When this time is added up, it
becomes a relevant amount of effort. This problem could be partially solved if we could
create a list describing the order of the snumbers that appear in the plots. Unfortunately,
all configuration and pileup files were erased before this function was applied, and thus,
it would be necessary to repeat all the bioinformatics pipeline (which was not made).
Secondly, CDS labels and highlights start to overlap when the GI represented is
crowded of CDS regions in the same area. This is an obstacle when trying to identify
where do these CDS belong to. This problem increases when there is not enough space
in the plot to display all the labels in a region of the ideogram. Circos is not designed for
this purpose, so no solution is satisfactory for now.
Thirdly, we can still miss information when the GI represented exceed 1Mb, as it
happened with the sequences of P. acnes. As I discussed before, histograms are no
displayed if their size are smaller than the size of a pixel. One solution could be to
convert the read depth value into a binary value: presence or absence of the position. In
this case, we would still miss the coverage information, which is not the point of the
histograms. A less drastic solution would be to make intervals of the read depth value,
and represent in the histogram the mean value of the intervals. This way, the histogram
will be composed of a higher amount of positions of the sequence, and therefore, its size
will be bigger. Anyway, this option was not implemented, since our collaborators were
satisfied with the Circos plots.
Differences between BLASTn and BWA alignment results in the
virus genomes
Although we were able to find many viruses in the cancer samples, some of the virus
GIs had no reads mapped against them when using BWA. This would mean that these
viruses are not present in the cancer samples. However, all virus GIs could be found in
the BLASTn taxonomic assignation. This conflict can have two possible explanations:
(1) the GI-snumber read combinations did not meet the 200 base pair requirement, and
thus, no reads were aligned for those GI; (2) the algorithm differences between
BLASTn and BWA MEM produces different results. The BLASTn is a local aligner,
while BWA MEM attempts a semi-local alignment. In addition the BLASTn search was
performed with low restricted conditions, allowing to taxonomically assign a read to a
GI even if its similarity to the GI was low. On the other hand, BWA MEM is very
restrictive and only high similarity (nearly 100%) reads are count as matched.
A sample of all the viruses that could not be mapped was re-aligned without the 200
base pair threshold without no success. Thus, the algorithm differences between
BLASTn and BWA are probably the responsible of the contradictory results. A fast
check over the taxonomic characterization of a random sample of the unmapped reads
confirmed that these reads had low similarity values. These values are not high enough
for the BWA algorithm, and therefore, those reads could not be mapped.
Lack of negative controls
Healthy human tissues were not available from hospitals when the cancer samples were
collected, and consequently, there are no negative controls in this study. Thereby,
although we have found several GIs (P. acnes, Parvoviruses, Papillomaviruses, etc)
within the samples, we cannot prove that they are either associated with contamination
or cancer, since we had not healthy samples to compare with. Notwithstanding, other
arguments can be used to provide some highlights about which entities could be
possible contaminants. If a virus or bacteria is a contaminant, it should be found in a
high amount of different samples from different origins. In addition, the organism
should not be able to be associated with a specific group of cancer samples, whereas it
could be related with a specific technique or method. This can be tested by associating
the presence of an entity in either cancer samples or snumbers. In addition, this possible
contaminant has not been probably described before as a disease-causing agent. More
arguments could be made, but ultimately, if the entity studied meets these conditions is
a good candidate to be a contaminant.
Conclusion
In this study, we were able to observe the abundance of several taxonomical entities in
different cancer specimens. We proposed some viral contaminants and suggested other
many possible viral drivers of cancer. We also analyzed the consistent presence of
Propionibacterium acnes in these cancer samples. We concluded that this bacterium is
a highly suspicious organism and should be the objective of further studies. However,
since this study was performed without healthy controls, we cannot confirm our
hypothesis. Despite this fact, we believe that the arguments and results showed in this
study are valid enough to keep further investigations in this direction, as they could
eventually lead to the discovery of new diagnose methods or vaccines.
Acknowledgements
I would liketo thankmy colleagues at the Center for Biological Sequence Analysis, in
particular to Jens Friis-Nielsen, Marisa Matey and John Damm Sørensen for helping me
get started. To professor Søren Brunak for the opportunity to join the Integrative
Systems Biology group. I also thank our collaborators at the Center for Geogenetics: Dr.
Anders Johannes Hansen, Dr Kristin Rós Kjartansdóttir, Dr. Sarah Mollerup, Maria
Asplund and Dr. Tobias Mourier for their support and expertise.
I would like to especially thank my mentor José MG Izarzugaza for his guidance and
patience throughout this Master Thesis. This has been an unique experience that will
allow me to continue developing myself as a bioinformatician in Denmark.
Annexes. Supplemental Material
All Supplemental Material can be found in this link:
https://www.dropbox.com/sh/lz8gywt27jkym6h/AAD_n0ZOXlcBfSR4pJk4m4QUa?dl
=0
SM1. Table containing all the snumbers and their respective samples.
SM2. GI lists of virus contaminants, viral cancer drivers and P. acnes. Each list
describes whether the GI is plotted and a short description of the sequence.
SM3. A compressed RAR file with all the drawn plots, divided according to the list they
belong.
SM4. Python script that created and parsed the files necessary to draw the Circos plots.
SM5. Python script with three functions. (1) Use Circos software to produce plots in
PNG format. (2) Convert the PNG files into PDF files. (3) Merge all pdf files into a
single PDF file per GI.
References
1. zur Hausen, H. The search for infectious causes of human cancers: Where and why.
Virology 392, 1–10 (2009).
2. Akagi, K. et al. Genome-wide analysis of HPV integration in human cancers reveals
recurrent, focal genomic instability. Genome Res. 24, 185–199 (2014).
3. Araujo, A. & Hall, W. W. Human T-lymphotropic virus type II and neurological
disease. Ann. Neurol. 56, 10–19 (2004).
4. Coffin, J. M. Retroviruses. (CSHL Press, 1997).
5. Lynch, R. M., Shen, T., Gnanakaran, S. & Derdeyn, C. a. Appreciating HIV type 1
diversity: subtype differences in Env. AIDS Res. Hum. Retroviruses 25, 237–248
(2009).
6. Robertson, D. L. et al. HIV-1 nomenclature proposal. Science 288, 492–505 (2000).
7. Segata, N. et al. Computational meta'omics for microbial community studies. Mol.
Syst. Biol. 9, 1-15 (2013).
8. Allander, T., Emerson, S. U., Engle, R. E., Purcell, R. H. & Bukh, J. A virus
discovery method incorporating DNase treatment and its application to the
identification of two bovine parvovirus species. Proc. Natl. Acad. Sci. U. S. A. 98,
11609–11614 (2001).
9. Reyes, G. R. & Kim, J. P. Sequence-independent, single-primer amplification
(SISPA) of complex DNA populations. Mol. Cell. Probes 5, 473–481 (1991).
10. Depledge, D. P. et al. Specific Capture and Whole-Genome Sequencing of Viruses
from Clinical Samples. PLoS One 6, e27805 (2011).
11. Vinner, L. et al. Investigation of Human Cancers for Retrovirus by Low-Stringency
Target Enrichment and High-Throughput Sequencing. Sci. Rep. 5, 13201 (2015).
12. Zavala, G., Cheng, S., Zavala, G. & Cheng, S. Detection and Characterization of
Avian Leukosis Virus in Marek ’ s Disease Vaccines. Am. Assoc. Avian Pathol. 50,
209–215 (2006).
13. Perry, A. & Lambert, P. Propionibacterium acnes : infection beyond the skin. Expert
Rev. Anti. Infect. Ther. 9, 1149–1156 (2011).
14. Purdy, S. & de Berker, D. Acne vulgaris. Lancet 379, 361–372 (2012).
15. Liu, J. Draft Genome Sequences of Propionibacterium acnes Type Strain
ATCC6919 and Antibiotic-Resistant Strain HL411PA1. Genome Announc. 2, 6–7
(2014).
16. Holland, C. et al. Proteomic identification of secreted proteins of Propionibacterium
acnes. BMC Microbiol. 10, 230 (2010).
17. Rood, I. G. H., de Korte, D., Savelkoul, P. H. M. & Pettersson, A. Molecular
relatedness of Propionibacterium species isolated from blood products and on the
skin of blood donors. Transfusion 51, 2118–24 (2011)
18. Fassi Fehri, L. et al. Prevalence of Propionibacterium acnes in diseased prostates
and its inflammatory and transforming activity on prostate epithelial cells. Int. J.
Med. Microbiol. 301, 69–78 (2011).
19. Lysholm, F. et al. Characterization of the Viral Microbiome in Patients with Severe
Lower Respiratory Tract Infections, Using Metagenomic Sequencing. PLoS One 7
(2012).
20. Krzywinski, M. et al. Circos: An information aesthetic for comparative genomics.
Genome Res. 19, 1639–1645 (2009).
21. Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler
transform. Bioinformatics 25, 1754–60 (2009).
22. Li, H. Aligning sequence reads , clone sequences and assembly contigs with BWA-
MEM. Oxfor Univ. Press 00, 1–3 (2013).
23. Li, H. et al. The Sequence Alignment / Map format and SAMtools. Bioinformatics
25, 2078–2079 (2009).
24. Ehrengruber, M. U. & Goldin, A. L. Semliki Forest virus vectors with mutations in
the nonstructural protein 2 gene permit extended superinfection of neuronal and
non-neuronal cells. J. Neurovirol. 13, 353–63 (2007).
25. Janine, C., Johann, S. & Bernard, P. A Cryptic Transcription Promoter in the myb
Oncogene of Avian Myeloblastosis Virus. Virology 150, 252–259 (1986).
26. Gallian, P. et al. TT virus : a study of molecular epidemiology and transmission of
genotypes 1 , 2 and 3. 17, 43–49 (2000).
27. Kean, J. M., Rao, S., Wang, M. & Garcea, R. L. Seroepidemiology of Human
Polyomaviruses. J. Clin. Virol. 5, (2009).
28. Torelli, G. et al. Human herpesvirus-6 in human lymphomas: identification of
specific sequences in Hodgkin’s lymphomas by polymerase chain reaction. Blood
77, 2251–2258 (1991).
29. Väisänen, E. et al. Bufavirus in feces of patients with gastroenteritis, Finland.
Emerg. Infect. Dis. 20, 1077–80 (2014).
30. Kijpittayarit, S. & Razonable, R. R. JC Virus Infection After Transplantation:
Beyond the Classic Progressive Multifocal Leukoencephalopathy? Gastroenterol.
Hepatol. 3, 74–76 (2007).
31. Adesina, O. et al. Optic Neuropathy Caused by Propionibacterium acnes
Pachymeningitis. J. Neuroophthalmol. 1–4 (2014).
32. Mollerup S. et al. Propionibacterium acnes – disease causing agent or common
contaminant? Detection in diverse patient samples by next generation sequencing.
Submitted to J. Clin. Microbiol. (2016).