Transcriptome profile of the fathead minnow (Pimephales promelas) Wentworth S. A.1, Thede K.2, Aravindabose V.2, Monroe I.1, Thompson A. W.1, Garvin J.2, Packer R.1
1. Department of Biological Sciences, Columbian College of Arts and Sciences, The George Washington University, Washington, DC 2. Department of Physiology and Biophysics, School of Medicine, Case Western Reserve University, Cleveland, Ohio
Introduction Modern RNASeq technology provides a practical and accurate method for exploring the transcriptome and expressed proteins of non-model organisms. One such organism, the fathead minnow (Pimephales promelas), is a widely used toxicology model, however the lack of knowledge about its genetics has hindered further study into the interactions between this organism and its environment. In this study, we present the fully assembled and annotated gill transcriptome of the rosy red strain of P. promelas to establish it as a foundation for further genetic and physiological research.
Acknowledgements The authors would like to thank Adam Wong and Brian Haas for their technical support, Colonial One for providing the high-performance computing power required form this work, the Case Western Reserve University Genomics Core for the RNA sequencing, Dr. Tara Scully, and the Harlan Undergraduate Research Program for funding this project. *References available upon request
Most Represented KEGG PathwaysN
umbe
r of P
athw
ay H
its
0
25
50
75
100
125
150
175
200
PI3K
-Akt
sign
alin
gM
APK
signa
ling
Ras s
igna
ling
Endo
cyto
sisN
euro
activ
e in
tera
ctio
nCy
tosk
elet
al re
gula
tion
Rap
Cyto
kine
inte
ract
ion
Foca
l adh
esio
n
Ribo
som
eRN
A tra
nspo
rtPr
otei
n pr
oces
sing
in E
RPu
rine
met
abol
ism
Splic
eoso
me
Conclusion and Discussion Overall, the statistics of the assembled transcriptome denote that it is a high quality assembly. The coverage of the reads back to the transcriptome is ~270x, a coverage substantially higher than what is generally accepted (~30x)11. Furthermore, the N50 statistic—the contig length at which all longer contigs will comprise at least 50% of the total assembled bases of the transcriptome—at 2,271 bp is also higher than is normally found with other transcriptome profiles (more around ~1,500 bp). It may be noted that the BLAST similarity distribution shows a majority of sequences having less than 50% identity with known gene sequences, however this is not surprising and can be explained. First, the transcriptome was only BLASTed against the UniProt/SwissProt database which, while very extensive and complete, is manually reviewed and annotated unlike the TrEMBL or nr database. This database is less likely to include sequences which are not extensively reviewed and may not provide hits for more recently sequenced and closely related organisms. Additionally, because P. promelas was BLASTed against a database containing a majority of mammalian genes, and the closest relative of the fathead minnow which has been extensively annotated is Danio rerio and, even though it is in the same family of cyprinidae, many regions may simply be evolutionarily homologous, yet not sufficiently identical to provide high similarity scores. Finally, while these similarity scores are not of the best quality, the E-values, even those removed from the above graph, show remarkable significance and quality.
3.7%5.9%
16.8%
41.3%
30.1%
2.2%<25%25% to 35%35% to 50%50% to 70%70% to 85%85% to 100%
BLAST Similarity Distribution
Future Work In continuation of our findings we would like to explore changes in the transcriptome for fathead minnows acclimatized to the extremes of seasonal water conditions (~5ºC and ~25ºC) to determine what changes in the expression of various genes with relation to changing environmental conditions. Additional explorations into a variety of environmental factors such as salinity or acidity could add to the larger picture, and could even lead to explanations of how the fathead minnow maintains homeostasis as environmental conditions change.
8.6%
5.2%
7.7%
12.6%
32.7%
33.2%
1E-3 to 11E-3 to 1E-201E-20 to 1E-501E-50 to 1E-1001E-100 to 00
BLAST E-Value Distribution
Note: ~28,000 hits with an E-value greater than 1 were removed for graph clarity
Materials and Methods RNA Extraction and Sequencing:
1. Fathead minnows were acclimatized to 25ºC before tissue extraction.
2. Gill samples were perfused to clear blood prior to RNA extraction.
3. Stranded cDNA libraries were constructed from each RNA sample and
sequenced by an Illumina HiSeq 2500.
Transcriptome Assembly: 1. Quality control was performed with the fasts toolkit to trim bases that have
been shown to be biased in Illumina sequencing1.
2. Cleaned Reads were assembled de novo into transcripts using the Broad
Institute’s Trinity software package2,3 using the default parameters.
Transcriptome Annotation: 1. The assembled transcripts were annotated using the Broad Institute’s Trinotate
workflow2,3.
2. The transcripts were BLASTed4 against the Uniprot/SwissProt5 database.
3. Further annotation was done using software such as Transdecoder2,3, SignalP6,
tmhmm7, and RNAMMER9, which annotate other gene and transcript features.
4. Additional annotation using a WEGO level 2 gene ontology classification9 and
a Kyoto Encyclopedia of Genes and Genomes pathway classification10
determined the categories of gene function and protein pathways
represented in the transcriptome.
Coun
t
0
10000
20000
30000
40000
50000
200-300 300-500 500-1000 1000-2000 2000-3000 3000-50005000-10000 >10000
1832,844
8,25311,145
21,364
25,564
33,780
49,985 Length Distribution of Transcripts in Base Pairs
bp
Summary of Transcriptome DataTotal raw reads 470,813,940Total cleaned reads 470,812,874Length of reads(bp) 101Total length of reads 47.5 GbpTotal length of assembly 152 MbpNumber of Transcripts 153,118Transcript N50 2,271Mean length of transcript 996Median length of transcript 435Transcripts annotated 150,773Unique genes represented 72,334
Cellular Component
Num
ber o
f Gen
es
Molecular Function Biological Process
1
10
100
1000
10000
cell
cell
part
enve
lope
extra
cellu
lar r
egio
nex
trace
llula
r reg
ion
part
mac
rom
olec
ular
com
plex
mem
bran
e-en
clos
ed lu
men
orga
nelle
orga
nelle
par
tsy
naps
esy
naps
e pa
rtvi
ron
viro
n pa
rtsy
mpl
ast
antio
xida
ntau
xilli
ary
trans
port
prot
ein
bind
ing
cata
lytic
chem
oattr
acta
ntel
ectro
n ca
rrier
enzy
me
regu
lato
rm
etal
loch
aper
one
mol
ecul
ar tr
ansd
ucer
prot
ein
tag
stru
ctur
al m
olec
ule
trans
crip
tion
regu
lato
rtra
nsla
tion
regu
lato
rnu
trien
t res
evoi
rch
emor
epel
lant
trans
porte
r
anat
omic
al st
ruct
ure
form
atio
nbi
olog
ical
adh
esio
nbi
olog
ical
regu
latio
nce
ll ki
lling
cellu
lar c
ompo
nent
bio
gene
sis
cellu
lar c
ompo
nent
org
aniza
tion
cellu
lar p
roce
ssde
ath
deve
lopm
enta
l
esta
blish
men
t of l
ocal
izatio
ngr
owth
imm
une
syst
em p
roce
sslo
caliz
atio
nm
etab
olic
pro
cess
mul
tiorg
anism
pro
cess
mul
ticel
lula
r org
anism
al p
roce
sspi
gmen
tatio
nre
prod
uctio
nre
prod
uctiv
e pr
oces
sre
spon
se to
stim
ulus
rhyt
hmic
pro
cess
vira
l rep
rodu
ctio
nlo
com
otio
n
Gene Ontology Classifications
Results