balli: bartlett-adjusted likelihood-based linear model approach...
TRANSCRIPT
METHODOLOGY ARTICLE Open Access
BALLI: Bartlett-adjusted likelihood-basedlinear model approach for identifyingdifferentially expressed genes with RNA-seqdataKyungtaek Park1, Jaehoon An2, Jungsoo Gim3, Minseok Seo4,5, Woojoo Lee6, Taesung Park1,7 andSungho Won1,2,8*
Abstract
Background: Transcriptomic profiles can improve our understanding of the phenotypic molecular basis ofbiological research, and many statistical methods have been proposed to identify differentially expressed genes(DEGs) under two or more conditions with RNA-seq data. However, statistical analyses with RNA-seq data are oftenlimited by small sample sizes, and global variance estimates of RNA expression levels have been utilized as priordistributions for gene-specific variance estimates, making it difficult to generalize the methods to more complicatedsettings. We herein proposed a Bartlett-Adjusted Likelihood-based LInear mixed model approach (BALLI) to analyzemore complicated RNA-seq data. The proposed method estimates the technical and biological variances with alinear mixed-effects model, with and without adjusting small sample bias using Bartlkett’s corrections.
Results: We conducted extensive simulations to compare the performance of BALLI with those of existingapproaches (edgeR, DESeq2, and voom). Results from the simulation studies showed that BALLI correctly controlledthe type-1 error rates at various nominal significance levels and produced better statistical power and precisionestimates than those of other competing methods in various scenarios. Furthermore, BALLI was robust to variationof library size. It was also successfully applied to Holstein milk yield data, illustrating its practical value.
Conclusions;: BALLI is statistically more efficient and valid than existing methods, and we conclude that it is usefulfor identifying DEGs in RNA-seq analysis.
Keywords: Differentially expressed genes, RNA sequencing, Linear mixed model, Bartlett’s correction
BackgroundTranscriptomic profiles can improve our understandingof the phenotypic molecular basis of biological research,and many attempts have been made to identify differen-tially expressed genes (DEGs) by microarray analysis.However, microarray analysis often suffers from manysystematic errors, such as hybridization and dye-baseddetection bias, hampering the detection of true DEGs [1,2]. Recently, high-throughput sequencing technology has
markedly improved. RNA sequencing (RNA-seq), alsocalled whole-transcriptome shotgun sequencing, usesnext-generation sequencing to quantify the abundanceof transcripts with several desirable features, such as in-creased dynamic range and freedom from a priorichosen probes [3]. Furthermore, RNA-seq is robustagainst systematic errors and has therefore emerged as asuccessful alternative to microarray analysis [4].RNA-seq quantifies the numbers of reads aligned to
particular transcripts or genes, and various approacheshave been proposed to manage RNA-seq data [5]. Thereare two different types of statistical methods: read-count-based approaches and transformation-based ap-proaches. Read-count-based approaches assume that
© The Author(s). 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, andreproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link tothe Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
* Correspondence: [email protected] Program of Bioinformatics, Seoul National University, Seoul08826, South Korea2Department of Public Health Science, Seoul National University, Seoul08826, South KoreaFull list of author information is available at the end of the article
Park et al. BMC Genomics (2019) 20:540 https://doi.org/10.1186/s12864-019-5851-6
observed read counts follow negative binomial distribu-tion, and generalized linear regression with a logarithmas a link function is utilized. These approaches typicallyassume that variances include biological and technicalvariances; the latter indicates variance observed amongmeasurements of the same biological unit, and theformer indicates variance between different biologicalunits, such as different subjects or different tissues ofthe same subject. If technical replicates are analyzed, ob-served read counts from technical replicates have thesame means under the same conditions. Marioni et al.(2008) demonstrated that the data follow a Poisson dis-tribution, and variances in technical replicates are ex-pected to be the same as their means for each gene [6].However, if biological replicates are available, means andvariances of read counts differ among different biologicalunits. Bullard et al. (2010) carefully examined such vari-ability and concluded that the biological variances wereusually larger than technical variances, supporting thepresence of overdispersion [7]. Thus, negative binomialdistribution has often been utilized; edgeR [8] andDESeq2 [9] are such methods. Transformation-based ap-proaches assume that the transformed read counts foreach gene follow normal distribution. For example,voom calculated proportions of read counts for eachgene per subject, and the log-transformed values werethen assumed to follow normal distribution, assumingthat the relative proportion of technical variances be-comes smaller if the read count grows larger [10].Negative-binomial distributions for read counts and
normal distributions for log-transformation of counts permillion (CPM) successfully describe distributions of RNA-seq data. However, RNA-seq is relatively expensive com-pared with microarray, and thus, further adjustment hasbeen made to handle the problem of small sample size. Ifthe sample size is small, the estimated variance can havelarge standard errors, and thus, multiple methods that in-corporate prior knowledge have been proposed. For ex-ample, variances of read counts are assumed to bepositively related to their means, and their relationshipscan be estimated by comparing the means and variancesof read counts for all genes. This can often be utilized toshrink variance parameters [11, 12]. edgeR and DESeq2estimate the overall dispersion parameter for all genes andare then combined with gene-wise dispersion parametersfor each gene using empirical Bayesian rules. Voom as-sumes that the variances of log-transformed CPM (log-cpm) are functionally related to their means. Locallyweighted scatterplot smoothing (LOWESS) curves be-tween the mean and residual variances of genes are thenutilized to weight variance estimates of each gene.Existing methods shrink the variance estimate of each
gene toward global variance estimates or use varianceestimates based on the relationships between means and
variances. Such assumptions are often very useful if thesample sizes are small. However, there are multiple fac-tors that can distort these relationships, and if they areviolated, the performance of existing approaches can beaffected. For example, the quality of data is highlydependent on the preparation steps, and unexpectednoise, such as noise from different storage periods or se-quencing organization of samples, can occur duringpreparation steps. Moreover, read counts of cancer tis-sues are more heterogeneous than those of normal tis-sues, and biological variances can be affected by diseasestatus [13]. Thus, their effects can distort the relation-ship between technical and biological variances. Multiplestudies have shown that misspecified relationships canlead to substantial biases [14, 15]. For example, varianceestimators for random effects, which are assumed to fol-low normal distribution, can be seriously biased unlessthey are normally distributed [14]. General approachesthat are not sensitive to these problems are necessary.In this article, we present new methods for identifying
DEGs with RNA-seq data, namely, BALLI and LLI. Statis-tical analyses with log-transformed read counts are oftenmore powerful than other existing methods and are rela-tively insensitive to various errors [16, 17]. Thus, we con-sidered log-cpm as response variables and used linearmixed-effects models to estimate technical and biologicalvariance. Furthermore, Bartlett-adjusted likelihood ratiotests were applied to correct the small sample bias [18].By allowing model comparisons among different models,our models enabled robust analyses for various scenarios.For our simulation studies, artificial RNA-seq data weregenerated based on real data and negative binomial distri-butions. Our studies showed that the proposed methodperformed better than existing methods. The proposedmethods were applied to Holstein milk yield data at thefalse discovery rate (FDR)-adjusted 0.1 significance leveland uniquely produced significant results. The proposedmethods were implemented as an R package and are freelydownloadable at CRAN (http://cran.r-project.org) orhttp://healthstat.snu.ac.kr /software/balli/.
MethodsNotationsWe assumed that there were M different groups, andthe averages of the expressed read counts for each genewere compared among these groups. For case-controlstudies, M = 2. We assumed that there were nm subjectsin group m and denoted the total sample size by N; thus,we got N = n1 + … + nM. We defined dummy variablesfor subject i in group m by
xmi ¼ 1 if subject i is in group m0 o:w:
�;
Park et al. BMC Genomics (2019) 20:540 Page 2 of 14
m ¼ 1;…;M−1; i ¼ 1; 2;…;N :
A design matrix for group variables is defined by
X¼x11 ⋯ x M−1ð Þ1⋮ ⋱ ⋮
x1N ⋯ x M−1ð ÞN
0@
1A:
We assumed that indexes of all subjects were sorted inascending order of groups. Thus, the first n1 subjectswere in group 1; the second n2 subjects were in group 2;and so on. The effect of continuous variables can also betested, and then some or all of the columns of the designmatrix, X, become their realization. We assumed thatexpressed read counts were observed for G genes andwere denoted by rgi for gene g of subject i (g = 1, …., G).Then, the library size for subject i, Ri, was equivalent to
Ri ¼Xg
rgi . We transformed count data into the log-
transformed read counts per million (log-cpm) using thecpm function in the edgeR R package. If we denoted thenormalized Ri with the trimmed mean of the M-value[19] by R�
i , the log-cpm of gene g for subject i were de-fined by:
ygi ¼ log2rgi þ di
R�i þ 2 � di
� 106� �
;where di
¼ R�i
1N
XN
i¼1R�i
� 0:25… ð1Þ
and their vector Yg was defined as
Yg¼ yg1; yg2;…; ygN� �t
:
Technical and biological variances of ygiWe assumed that rgi followed a negative binomial distri-bution, and its mean and variance were μgi and μgi þ μ2giϕg , respectively. If we let the mean and variance of log2-rgi be λgi and s2gi , respectively, then var(ygi) can be ob-
tained by the first order approximation as follows [10]:
var ygi� �
≈ var log2rgi� �
¼ s2gi ≈ var μgi þrgi−μgiμgi
!¼ 1
μgi2var rgi� �
¼ 1μgi
þ ϕg… ð2Þ
Here, μ−1gi and ϕg indicate the variances attributable to
the technical and biological replicates, respectively. Bysecond order approximation, the technical variance, 1/μgi, can be expressed in terms of λgi and s2gi as follows:
μgi ¼ E rgi� � ¼ E 2 log2rgi
� �≈ E 2λgi þ loge2 � 2λgi � log2rgi−λgi
� �þ 12
loge2� �2 � 2λgi � log2rgi−λgi
� �2
¼ 2λgi � 1þ loge2 � E log2rgi−λgi� �þ 1
2loge2
� �2 � E log2rgi−λgi� �2� �
¼ 2λgi � 1þ 12
loge2� �2
s2gi
� �:
λgi and s2gi are functionally related, and both were esti-
mated with the method used for voom-trend [10] asfollows:
i. For all genes, g = 1,… , G, fit linear regressions, ygi¼ αg þ xtiβg þ ϵgi, and calculate ygi. Residual
variances are used as s2g . If environmental effects
affect ygi, then they should be included ascovariates.
ii. Calculate λg ¼ yg þ log2~R− log2106; where yg is
the average of ygi; ~R is the geometric mean of ðR�i
þ1Þ; and g = 1,… , G.iii. For ðλg ; s2gÞ; g ¼ 1;…;G, obtained from (i) and (ii),
fit LOWESS curve s1=2g on λg .
iv. Calculate λgi ¼ log2rgi ¼ ygi þ log2ðR�i þ 1Þ− log2
106 and apply LOWESS curve from (iii) to obtain
s1=2gi .
v. Calculate μgi by incorporating λgi and sgi to thefollowing equation:
μgi ≈ 2λgi � 1þ 12
loge2� �2
s2gi
� �… ð3Þ
Linear mixed-effects modelWe denoted a design matrix for nuisance effects, includ-ing an intercept by Z. We let bg and eg be vectors forrandom effects and measurement errors, respectively.Denoting a w ×w dimensional identity matrix by Iw, weconsidered the following linear mixed-effects model:
Yg ¼ Zαg þ Xβg þ bgþ eg ;bg∼MVNð0;ψgΣg;bÞ; eg∼MVNð0;Σg;eÞ;
Σg;b ¼μ−1g1 0 ⋯ 0
0 μ−1g2 ⋯ 0⋱
0 0 ⋯ μ−1gN
0BB@
1CCA;Σg;e ¼
σ2g1In1 0 ⋯ 0
0 σ2g2In2 ⋯ 0
⋱0 0 ⋯ σ2
gMInM
0BB@
1CCA:
Here, ψgΣg, b and σ2g indicate technical and biological
variances, respectively, according to Eq. (2), and the ran-dom effect, bg, and measurement error, eg, were used tomodel the technical and biological variances, respect-ively. The proposed linear mixed effects model may beconceptually useful for understanding the variance struc-ture of the RNA-seq data. Notably, elements of Σg, b are
Park et al. BMC Genomics (2019) 20:540 Page 3 of 14
obtained from Eq. (3), and ψg and σ2g ¼ ðσ2g1;…; σ2gM ) are
estimated. Equation (2) shows that ψg becomes 1, andwe assumed that σ2g1 ¼ … ¼ σ2gM . Then, our final model
becomes
Yg ¼ Zαg þ Xβg þ bgþ eg ;bg∼MVNð0;Σg;bÞ; eg∼MVNð0; σ2gIÞ;
Σg;b ¼μ−1g1 0 ⋯ 0
0 μ−1g2 ⋯ 0⋱
0 0 ⋯ μ−1gN
0BBB@
1CCCA;
which is equivalent to
Yg ¼ Zαg þ Xβg þ e′g ; e′g∼MVNð0;Σg;b
þ σ2g IÞ; σ2g ≥0:
Bartlett-adjusted profile likelihood ratio testsThe likelihood ratio test (LRT) is very flexible and canbe utilized for various purposes such as testing two ormore variables at once. However, statistical analyses withRNA-seq data often involve few samples, and the LRTstatistic has a bias with order Op(N
−1) to its null distri-bution. In such cases, an adjustment can be applied toreduce the bias, such as the Bartlett adjustment or Cox-Reid adjustment [18, 20]. As Zucker et al. (2000) showedthat the latter estimated p values more conservative thanthe former did in linear mixed models [21], we selectedthe Bartlett-adjusted LRT for identifying DEGs, whichreduces the order of bias to Op(N
−2) and controls type-1error rates well when the sample size is small [18]. If welet βg = (βg, 1, … , βg, M − 1)
t, Vg ¼ Σg;b þ σ2gI and θg ¼ ðαg ; σ2gÞ , the likelihood for the proposed linear mixed
model is
Lðθg ; βgÞ∝jVgj−12expð− 1
2ðYg−Zαg−XβgÞtVg
−1ðYg−Zαg−XβgÞÞ:
If we let θg0 be a maximum likelihood estimate (mle)under the parameter space for the null hypothesis H0:
βg = 0, and ðθg ; βgÞ be mles of (θg,βg) under the param-
eter space for null or alternative hypothesis, the LRT forthe null hypothesis H0: βg = 0 can be obtained by
LRg ¼ −2flogLðθg0; 0Þ−logLðθg ; βgÞg∼χ2ðd f¼ M−1ÞunderH0:
The Bartlett-adjusted LRT ( LR�g ) for gene g can be
expressed by
LR�g ¼
LRg
1þ Cg= M−1ð Þ � χ2 df ¼ M−1ð Þ under H0:
Cg can be obtained based on the results of Melo et al.(2009) [22], as follows:
Cg ¼ Dg−1 −
12Mg þ 1
4Pg−
12νgτg
� �:
Here, Dg, Mg, Pg, νg, and τg are scalars, and if we let
X′ = [I − Z(ZTVg−1Z)−1ZtVg
−1]X and _X0¼ZðZtVg
−1ZÞ−1ZtVg
−2X0, they are
Dg ¼ −12tr Vg
−2� �;
Mg ¼ 2tr X0tVg−1X0
� �−1X0tVg
−3X0− _X0tVg
−2X0� �� �
;
Pg ¼ tr X0tVg−2X0 X0tVg
−1X0� �−1� �2
!;
νg ¼ −tr ZtVg−1Z
� �−1ZtVg
−2Z� �
and
τg ¼ −tr X0tVg−1X0
� �−1X0tVg
−2X0� �
:
The forms of Dg, Mg, Pg, νg, and τg depend on thestructure of Vg, and counterparts to general Vg s areshown in Additional file 1.
Parameter estimationThe log-likelihood function for our final model is givenby
logLðθg ; βgÞ ¼ C−12logjVgj− 1
2ðYg−Zαg−XβgÞtVg
−1ðYg−Zαg−XβgÞ;C: someconstants:
θg and βg can be estimated by maximizing the log-
likelihood function. Then, if we let P = Vg−1 −Vg
−1(Z,X)((Z,X)tVg
−1(Z,X))(Z,X)tVg−1, the profile log-likelihood
of σ2g becomes
lPðσ2gÞ ¼ C−12logjVgj− 1
2Yg
tPYg :
Here, Vg is a function of σ2g , and σ2g can be obtained by
maximizing lPðσ2gÞ with Fisher’s scoring method. σ2ðmÞg at
the m step was updated by
σ2 mð Þg ¼ σ2 m−1ð Þ
g þ Io m−1ð Þ� −1
U σ2 m−1ð Þg
� �
Park et al. BMC Genomics (2019) 20:540 Page 4 of 14
, where Ioðm−1Þ ¼ Eð− ∂2lP
∂σ2ðm−1Þg ∂σ 2ðm−1Þ
g
TÞ and Uðσ2ðm−1Þg Þ
¼ ∂lP∂σ2ðm−1Þ
g
. We found that Fisher’s scoring method was
sometimes unsuccessful for estimating σ2ðmÞg , and in such
cases, we used Brent’s derivative-free method with theoptimize function in R [23]. It always converged and suc-cessfully estimated parameters, at least in our simula-tion. We assumed that σ2g was non-negative.
Vg¼Σg;b þ σ2g I , and if σ2g is equal to zero, Vg becomes
Σg, b. Then, βg and αg can be obtained by
ðαg
βgÞ ¼ ððZ;XÞtVg
−1
Z;X
!ÞðZ;XÞtVg
−1Y:
The Bartlett-adjusted LRT requires θg0 , which maxi-mizes the likelihood under the null hypothesis. Underthe null hypothesis, if we let P0 = Vg
−1 −Vg−1Z(ZtVg
−1Z)ZtVg
−1, the profile log-likelihood of σ2g becomes
lP0ðσ2gÞ ¼ C−12logjVgj− 1
2Yg
tP0Yg :
σ2g0 was estimated with Fisher’s scoring method, and if
we incorporated σ2g0 to Vg0 and denoted it as Vg0 , αg
could be obtained by
αg ¼ ðZtVg0−1ZÞZtVg0
−1Y:
DatasetsWe considered two real datasets consisting of samples fromunrelated Nigerian people and Holstein cattle, respectively.Nigerian subjects participated in the International HapMapProject and comprised 29 male and 40 female participants[24]. The read counts were downloaded from the ReCountwebsite [25]. Holstein data were obtained to identify genesassociated with milk yield and consisted of high- and low-milk-yielding groups, with nine and 12 samples per group,respectively [26]. Furthermore, parity and lactation periodswere available and were considered as covariates. We ob-tained the read counts from Gene Expression Omnibus(GSE60575). Based on count data, we generated simulationdata; the steps are described in Additional file 2.
ResultsSimulation studies with RNA-seq data from NigerianindividualsWe applied the proposed linear mixed models to the simu-lated data based on Nigerian individuals’ RNA-seq data andcalculated empirical type-1 error rates and statistical powerswith these models. The data were then compared withDESeq2 (v1.20.0), edgeR (v3.22.3), and voom (v3.36.5).
Table 1, Additional file 3, and Fig. 1 show results fromsimulation studies based on Nigerian individuals’ RNA-seqdata. Nigerian individuals’ RNA-seq data consisted of 52,580 genes, and after filtering genes whose total read countsacross samples were smaller than one-tenth of the samplesize, each replicate had around 10,000–10,500 genes. Em-pirical type-1 error rates and powers were estimated with50 replicates. Table 1 and Additional file 3 assumed δ = 0,and thus, their estimates indicated the empirical type-1error rates. For the proposed methods, we assumed thatψg = 1 and σ2
g1 ¼ σ2g2 , and the proposed methods with and
without Bartlett’s corrections are denoted as BALLI andLLI, respectively, for the remainder of this article. Accord-ing to Table 1 and Additional file 3, BALLI and voom al-ways controlled the nominal type-1 error rates correctly ifN was greater than or equal to 12. LLI also successfullycontrolled the nominal type-1 error rates if N was greaterthan or equal to 20. However, if N was less than 20, p valuesby LLI were inflated. edgeR showed the least performance,and the estimated type-1 error rates were always inflated at0.05, 0.01, and 0.005 nominal significance levels. Interest-ingly, DESeq2 tended to be conservative at 0.1 and 0.05 butliberal at 0.01 and 0.005 nominal significance levels. Thus,we could conclude that the proposed linear mixed modelwith Bartlett’s correction reasonably controlled the type-1error, and Bartlett’s correction was required if the samplesize was less than 20.Figure 1 shows estimated powers and precisions at the
FDR-adjusted 0.1 significance level when δ = 0.5σ or 1σ,and N = 12, 16, 20, 24, 28, 40, 64, or 68. Figure 1a and cshow the statistical power estimates, and Fig. 1b and dshow the precision. Precision indicates the proportions ofDEGs among genes for which FDR-adjusted p values areless than 0.1. According to Fig. 1a and c, LLI outperformedother methods in terms of power (Fig. 1a and c). However,it should be noted that this method showed worse precisionthan BALLI and voom if N was less than 40 (Fig. 1b and d),suggesting that LLI had larger false-positive rates than thoseof BALLI and voom when N was less than 40. The preci-sion of LLI was increased if N was sufficiently large. Interms of both power and precision, the best performancewas always obtained by BALLI. For example, when N = 20and δ = σ, the estimated power of BALLI was 0.597,followed by voom (0.548) and DESeq2 (0.399). The esti-mated precision of BALLI was 0.940, and those of voomand DESeq2 were 0.930 and 0.907, respectively. If N = 40and δ = 0.5σ, the estimated power and precision of BALLIwere 0.342 and 0.943, which were higher than those ofDESeq2 (0.205, 0.897) and voom (0.298, 0.932).Our method was also applied to simulation data based
on RNA-seq data from Holstein cows. The results weresimilar to those of the simulation data based on Nigerianpeople’s data. Additional file 4 shows that BALLI and
Park et al. BMC Genomics (2019) 20:540 Page 5 of 14
Table
1Estim
ated
type
-1errorrateswith
simulationdata
basedon
Nigerianpe
ople’sRN
A-seq
data.Estim
ated
type
-1errorratesby
BALLI,DESeq
2,ed
geR,LLI,andvoom
and
their95%
confiden
celevelswereestim
ated
forN=12
,16,20
and24
.aThetype
-1errorratesaremarkedby
bold
font
iftheir95%
confiden
celevelsinclud
eor
lower
than
the
nominalsign
ificant
levelα
BALLI
DESeq
2ed
geR
LLI
voom
BALLI
DESeq
2ed
geR
LLI
voom
αN=12
N=16
0.1
0.09
217a
(0.081
73,0.102
62)
0.08
221
(0.071
44,0.092
98)
0.10
536
(0.095
95,0.114
77)
0.12695
(0.11443,0.13947)
0.09
381
(0.083
06,0.104
56)
0.08
976
(0.080
33,0.099
19)
0.08
794
(0.078
27,0.097
61)
0.11070
(0.10223,0.11916)
0.11613
(0.10522,0.12704)
0.09
618
(0.086
83,0.105
52)
0.05
0.04
485
(0.038
14,0.051
55)
0.04
374
(0.036
33,0.051
15)
0.05
495
(0.048
77,0.061
14)
0.06636
(0.05767,0.07506)
0.04
473
(0.037
98,0.051
48)
0.04
329
(0.037
64,0.048
95)
0.04
672
(0.040
41,0.053
02)
0.05762
(0.05218,0.06305)
0.06072
(0.05355,0.06790)
0.04
645
(0.040
78,0.052
12)
0.01
0.00
899
(0.006
86,0.011
12)
0.01
208
(0.009
13,0.015
03)
0.01441
(0.01186,0.01697)
0.01662
(0.01328,0.01996)
0.00
860
(0.006
53,0.010
66)
0.00
796
(0.006
43,0.009
48)
0.01
202
(0.009
80,0.014
24)
0.01393
(0.01212,0.01575)
0.01349
(0.01118,0.01580)
0.00
850
(0.007
06,0.009
93)
0.005
0.00
456
(0.003
28,0.005
84)
0.00741
(0.00531,0.00952)
0.00891
(0.00709,0.01072)
0.00942
(0.00720,0.01163)
0.00
424
(0.003
01,0.005
48)
0.00
381
(0.003
05,0.004
58)
0.00695
(0.00555,0.00835)
0.00809
(0.00692,0.00925)
0.00714
(0.00575,0.00853)
0.00
397
(0.003
26,0.004
68)
αN=20
N=24
0.1
0.08
698
(0.076
89,0.097
07)
0.08
700
(0.076
53,0.097
47)
0.10962
(0.10028,0.11896)
0.10
564
(0.094
42,0.116
86)
0.09
289
(0.082
46,0.103
32)
0.08
480
(0.076
51,0.093
09)
0.08
744
(0.078
02,0.096
86)
0.10874
(0.10024,0.11724)
0.10
067
(0.091
33,0.110
00)
0.09
124
(0.081
96,0.100
51)
0.05
0.04
113
(0.034
55,0.047
71)
0.04
607
(0.038
70,0.053
44)
0.05774
(0.05147,0.06401)
0.05
408
(0.046
33,0.061
83)
0.04
517
(0.038
44,0.051
91)
0.03
919
(0.034
50,0.043
87)
0.04
518
(0.039
22,0.051
14)
0.05881
(0.05343,0.06419)
0.04
960
(0.044
02,0.055
19)
0.04
390
(0.038
49,0.049
31)
0.01
0.00
752
(0.005
48,0.009
55)
0.01
175
(0.008
90,0.014
60)
0.01381
(0.01152,0.01609)
0.01
157
(0.008
79,0.014
36)
0.00
848
(0.006
43,0.010
53)
0.00
647
(0.005
50,0.007
44)
0.01
050
(0.008
73,0.012
28)
0.01317
(0.01163,0.01472)
0.00
955
(0.008
17,0.010
93)
0.00
778
(0.006
48,0.009
07)
0.005
0.00
368
(0.002
48,0.004
89)
0.00
669
(0.004
75,0.008
63)
0.00789
(0.00643,0.00934)
0.00
605
(0.004
28,0.007
83)
0.00
420
(0.003
05,0.005
35)
0.00
293
(0.002
48,0.003
38)
0.00
575
(0.004
77,0.006
72)
0.00728
(0.00636,0.00819)
0.00
474
(0.004
00,0.005
47)
0.00
363
(0.002
95,0.004
31)
Park et al. BMC Genomics (2019) 20:540 Page 6 of 14
LLI controlled the nominal type-1 error rates if N ≥ 8and if N ≥ 20, respectively. Power estimates were highestfor LLI, but it always had a smaller precision value thanthose of BALLI and voom (Additional file 5).
Simulation studies with randomly generated RNA-seqdataRNA-seq data are generally known to follow negativebinomial distribution, and we conducted simulationstudies with RNA-seq data generated from negative
binomial distributions. First, we assumed that librarysizes were the same among subjects. The overalltrend of the estimated type-1 error rate was similarto that of simulation studies based on real RNA-seqdata. Estimated type-1 error rates by voom andBALLI usually maintained the nominal significancelevels (Table 2 and Additional file 6). P values ob-tained from LLI and edgeR tended to be inflated.DESeq2 generally showed deflation of type-1 errorrates at a 0.1 nominal significance level. Second, we
Fig. 1 Estimated powers and precisions with simulation data based on Nigerian people’s RNA-seq data. Statistical powers of BALLI, DESeq2,edgeR, LLI, and voom were estimated at FDR-adjusted 0.1 significance level when δ = 0.5σ or 1σ and N = 12, 16, 20, 24, 28, 40, 64, and 68.a Estimated power when δ=0.5σ. b Estimated precision when δ=0.5σ. c Estimated power when δ=1σ. d Estimated precision when δ=1σ
Park et al. BMC Genomics (2019) 20:540 Page 7 of 14
Table
2Estim
ated
type
-1errorrateswith
simulationdata
basedon
simulated
RNA-seq
data
from
negativebino
miald
istribution.Estim
ated
type
-1errorratesby
BALLI,DESeq
2,ed
geR,LLI,andvoom
andtheir95%
confiden
celevelswereestim
ated
forN=12
,16,20
,and
24.aThetype
-1errorratesaremarkedby
bold
font
iftheir95%
confiden
celevels
includ
eor
lower
than
theno
minalsign
ificant
levelα
BALLI
DESeq
2ed
geR
LLI
voom
BALLI
DESeq
2ed
geR
LLI
voom
αN=12
N=16
0.1
0.09
550a
(0.093
94,
0.09
706)
0.08
417(0.082
78,
0.08
556)
0.12172(0.11984,
0.12360)
0.13198(0.13030,
0.13367)
0.09
912(0.097
67,
0.10
057)
0.09
211(0.090
97,
0.09
325)
0.08
428(0.083
07,
0.08
549)
0.12028(0.11911,
0.12145)
0.11853(0.11729,
0.11977)
0.09
977(0.098
68,
0.10
086)
0.05
0.04
863(0.047
57,
0.04
969)
0.04
334(0.042
43,
0.04
425)
0.05604(0.05492,
0.05716)
0.07063(0.06926,
0.07200)
0.05
013(0.048
99,
0.05
126)
0.04
508(0.044
29,
0.04
587)
0.04
333(0.042
61,
0.04
404)
0.05616(0.05528,
0.05703)
0.06345(0.06239,
0.06452)
0.05
013(0.049
37,
0.05
089)
0.01
0.00
989(0.009
39,
0.01
039)
0.00
997(0.009
54,
0.01
040)
0.01267(0.01209,
0.01326)
0.01858(0.01798,
0.01919)
0.00
997(0.009
57,
0.01
037)
0.00
893(0.008
47,
0.00
939)
0.01
008(0.009
61,
0.01
056)
0.01276(0.01225,
0.01328)
0.01469(0.01419,
0.01519)
0.00
996(0.009
69,
0.01
024)
0.005
0.00
524(0.004
92,
0.00
557)
0.00549(0.00517,
0.00581)
0.00689(0.00650,
0.00729)
0.01043(0.00992,
0.01094)
0.00
488(0.004
56,
0.00
519)
0.00
445(0.004
14,
0.00
475)
0.00543(0.00515,
0.00572)
0.00682(0.00653,
0.00711)
0.00803(0.00759,
0.00846)
0.00
506(0.004
83,
0.00
528)
αN=20
N=24
0.1
0.09
198(0.090
51,
0.09
346)
0.08
419(0.082
82,
0.08
557)
0.11844(0.11718,
0.11971)
0.11073(0.10925,
0.11221)
0.10
023(0.099
08,
0.10
139)
0.09
276(0.091
56,
0.09
395)
0.08
521(0.083
86,
0.08
657)
0.11846(0.11686,
0.12005)
0.10867(0.10747,
0.10987)
0.09
963(0.098
33,
0.10
092)
0.05
0.04
463(0.043
76,
0.04
549)
0.04
296(0.042
05,
0.04
387)
0.05748(0.05625,
0.05871)
0.05780(0.05662,
0.05898)
0.05
048(0.049
36,
0.05
159)
0.04
468(0.043
94,
0.04
541)
0.04
268(0.041
88,
0.04
348)
0.06050(0.05953,
0.06147)
0.05518(0.05427,
0.05608)
0.04
907(0.048
15,
0.05
000)
0.01
0.00
861(0.008
12,
0.00
911)
0.00
986(0.009
36,
0.01
036)
0.01271(0.01204,
0.01338)
0.01328(0.01267,
0.01388)
0.01
016(0.009
54,
0.01
077)
0.00
844(0.008
10,
0.00
878)
0.00
940(0.008
99,
0.00
982)
0.01242(0.01203,
0.01281)
0.01212(0.01167,
0.01257)
0.00
957(0.009
07,
0.01
007)
0.005
0.00
457(0.004
25,
0.00
489)
0.00536(0.00501,
0.00572)
0.00688(0.00647,
0.00730)
0.00707(0.00661,
0.00753)
0.00
521(0.004
84,
0.00
559)
0.00
405(0.003
86,
0.00
425)
0.00
497(0.004
69,
0.00
525)
0.00662(0.00639,
0.00685)
0.00625(0.00595,
0.00655)
0.00
474(0.004
40,
0.00
507)
Park et al. BMC Genomics (2019) 20:540 Page 8 of 14
considered the effects of library size variance on stat-istical analyses. Data with unequal library sizes amongsubjects were generated by negative binomial distribu-tion whose mean parameters (agi) were the product ofmean estimates, under the equal library size assump-tion, and random numbers from U(u, 2 − u), whereu = 0.2, 0.4, 0.6, 0.8, or 1, and dispersion parameters
were estimated from ð0:2þ a−1=2gi Þ2 � δg , where 40=δg� χ240 . If u became smaller, the library size had largervariances. Figure 2 shows the estimated type-1 errorrates at the 0.05 significance level according to differ-ent choices of u. Figure 2 shows that voom was sensi-tive to the amount of library size variation andbecame conservative in the context of large librarysize variation. Compared with voom, BALLI and LLIwere robust with regard to u. The estimated type-1error rates of LLI were affected by sample size. If Nwas larger than or equal to 40, LLI controlled thetype-1 error rates the most correctly and was not af-fected by the library size variation. BALLI was slightlyconservative, but the amount remained constant. Re-sults at the 0.005 significance level are provided inAdditional file 7, and the general pattern was similarto that in Fig. 2, except that DESeq2 was conservativeat a significance level of 0.05 and liberal at 0.005
(Additional file 7). Therefore, we could conclude thatthe performances of BALLI and LLI were robust, andwe recommend using BALLI if 10 ≤N ≤ 40, and LLI ifN > 40.Figure 3 and Additional file 8 show the estimated statis-
tical powers and precision according to different choices ofu. BALLI usually had the best estimated power and preci-sion, as was observed in simulation studies based on realRNA-seq data. For example, when u = 0.2, N = 28, and δ =0.5σ, the estimated power by BALLI was 0.438, whereasthose for DESeq2 and voom were 0.317 and 0.352, respect-ively (Fig. 3a). Results when u = 0.2 and δ = 1σ in Fig. 3c arevery similar as those for Fig. 3a. Figure 3b and d also showthat BALLI achieves the best estimated precisions. Similarpatterns were observed when u = 0.4, 0.6, 0.8, and 1 (Add-itional file 8). In summary, we can conclude that BALLIshows better performance than that of other methods.
DEGs of Holstein milk dataHolstein milk data of 21 Holstein cows were generatedto detect genes related to the daily productivity of milk.High and low milk yields were considered the primaryexposure variables, and parity and lactation period wereincluded as covariates, as in the original study [26]. Inthis study, 12 tentative DEGs were chosen and
Fig. 2 Effect of varying library sizes on the type-1 error rates. Type-1 error rates were estimated by BALLI, DESeq2, edgeR, LLI, and voom whenu = 0.2, 0.4, 0.6, 0.8, and 1 and sample size (N) is 12 (a), 16 (b), 20 (c), 24 (d), 28 (e), 40 (f), 64 (g), and 68 (h) at the 0.05 nominal significance level
Park et al. BMC Genomics (2019) 20:540 Page 9 of 14
technically validated using quantitative real-time poly-merase chain reaction (qRT-PCR). qRT-PCR was con-ducted with QuantiTect SYBR Green RT-PCR MasterMix (Qiagen, Valencia, CA, USA), and a 7500 Fast Se-quence Detection System (Applied Biosystems, FosterCity, CA, USA) was used to confirm whether the 12 ten-tative genes were true DEGs. Among the 12 genes, nine(TOX4, HNRNPL, SPTSSB, NOS3, C25H16orf88,KALRN, SLC4A1, NLN, and PMCH) were significantlyvalidated. According to Seo et al. (2016), however, noDEGs, including the nine genes, were found at an FDR
0.1 significance level by DESeq2, voom, and theirmethods owing to the lack of statistical power [26]. Ourproposed methods and existing methods (DESeq2,edgeR, and voom) were applied to the data analysis. LLIonly detected significant differences for the TOX4 genebetween the high and low milk yield groups at the FDR-adjusted 0.1 significance level, but other methods didnot detect any significant genes. The FDR-adjusted pvalue of TOX4 by BALLI was 0.1267, which was muchsmaller than those of DESeq2, edgeR, and voom. Table 3shows p values for the nine genes, including TOX4. P
Fig. 3 Effect of varying library sizes on the statistical power and precision. Statistical powers and precisions for BALLI, DESeq2, edgeR, LLI, andvoom were empirically estimated at FDR-adjusted 0.1 significance level when u = 0.2, δ = 0.5σ or 1σ and sample size (N) is 12, 16, 20, 24, 28, 40,64, and 68. a Estimated power when u=0.2 and δ=0.5σ. b Estimated precision when u=0.2 and δ=0.5σ. c Estimated power when u=0.2 and δ=1σ.d Estimated precision when u=0.2 and δ=1σ
Park et al. BMC Genomics (2019) 20:540 Page 10 of 14
values for most of the nine genes obtained by LLI andBALLI were smaller than those obtained by othermethods. The nine genes were not significantly affectedby parity or lactation period (Additional file 9). We alsoanalyzed all genes by the proposed methods; Fig. 4shows the number of genes that were significant at the0.001 nominal significance level. There were no DEGsthat were commonly significant only for all existingmethods (DESeq2, edgeR and voom). Three genes, in-cluding HNRNPL, were detected as DEGs by only BALLIand LLI (Fig. 4). Table 4 shows 12 genes that were com-monly significant by BALLI, DESeq2, edgeR, LLI, andvoom at the 0.005 nominal significance level. Of the 12
genes, all genes except C4BPA had the lowest p valuesin LLI, and three genes had lower p values in BALLIthan in DESeq2, edgeR, and voom. This analysis withBALLI took 153.35 s using an Intel Xeon E7–4820 2.00GHz processor, which was quite a bit longer than othermethods, voom (2.81 s), DESeq2 (27.21 s) and edgeR(33.25 s) (Additional file 10). However, BALLI can con-duct the multi-threaded analyses with a simple optionunlike other methods and analyses of whole genes canbe completed within a reasonable time. For example, theanalyses with 20 cores took only 11.99 s under the sameconditions (Additional file 10). Simulation studies re-vealed that LLI tended to be liberal, with the results
Table 3 True DEG analysis results of Holstein milk data. Holstein milk data was analyzed by BALLI, DESeq2, edgeR, LLI, and voomand their p values (FDRs) are provided
BALLI DESeq2 edgeR LLI voom
TOX4 1.058 × 10−5 (1.267 × 10− 1) 2.949 × 10−4 (6.298 × 10− 1) 2.797 × 10−2 (1) 7.476 × 10−7 (8.947 × 10− 3) 4.980 × 10− 3 (9.999 × 10− 1)
HNRNPL 3.913 × 10− 4 (9.222 × 10− 1) 1.551 × 10− 3 (9.997 × 10− 1) 1.684 × 10− 1 (1) 6.796 × 10− 5 (1.787 × 10− 1) 4.677 × 10− 2 (9.999 × 10− 1)
SPTSSB 4.457 × 10− 4 (9.222 × 10− 1) 2.660 × 10− 3 (9.997 × 10− 1) 2.160 × 10− 4 (1) 8.686 × 10− 5 (1.787 × 10− 1) 1.037 × 10− 3 (9.999 × 10− 1)
NOS3 4.676 × 10−4 (9.222 × 10− 1) 2.396 × 10− 4 (6.928 × 10− 1) 2.549 × 10− 4 (1) 8.957 × 10− 5 (1.787 × 10− 1) 8.693 × 10− 4 (9.999 × 10− 1)
SLC4A1 2.145 × 10−2 (9.999 × 10− 1) 8.753 × 10− 2 (9.997 × 10− 1) 1.025 × 10− 1 (1) 9.856 × 10− 3 (7.179 × 10− 1) 7.579 × 10− 2 (9.999 × 10− 1)
NLN 9.513 × 10− 2 (9.999 × 10− 1) 3.028 × 10− 1 (9.997 × 10− 1) 3.789 × 10− 1 (1) 6.084 × 10− 2 (9.774 × 10− 1) 1.214 × 10− 1 (9.999 × 10− 1)
KALRN 9.792 × 10− 2 (9.999 × 10− 1) 8.943 × 10− 2 (9.997 × 10− 1) 8.815 × 10− 2 (1) 6.307 × 10− 2 (9.790 × 10− 1) 1.054 × 10− 1 (9.999 × 10− 1)
PMCH 1.635 × 10− 1 (9.999 × 10− 1) 2.225 × 10− 1 (9.997 × 10− 1) 2.353 × 10− 1 (1) 1.176 × 10− 1 (9.999 × 10− 1) 1.758 × 10− 1 (9.999 × 10− 1)
C25H16orf88 1.765 × 10− 1 (9.999 × 10− 1) 1.494 × 10− 1 (9.997 × 10− 1) 2.627 × 10− 1 (1) 1.289 × 10− 1 (9.999 × 10− 1) 1.516 × 10− 1 (9.999 × 10− 1)
Fig. 4 Significant genes of Holstein milk data. Venn diagram was provided with significant genes at the 0.001 nominal significance level by BALLI,DESeq2, edgeR, LLI, and voom
Park et al. BMC Genomics (2019) 20:540 Page 11 of 14
likely to be inflated. However, BALLI controlled thenominal significance level, and p values by BALLI wereexpected to be statistically valid. Therefore, we con-cluded that the proposed method, BALLI, worked wellfor real data analysis.
DiscussionIn this article, we suggested new methods, designatedBALLI and LLI, for identifying DEGs with RNA-seqdata. We assumed that log-cpm values of read countsasymptotically followed normal distributions, and thelinear mixed-effects model with Bartlett’s correction wasproposed. The proposed methods were compared withexisting methods, such as DESeq2, edgeR, and voom,with extensive simulation studies. According to our re-sults, negative-binomial-based approaches often failed topreserve the nominal type-1 error rates. For example, pvalues from edgeR were inflated. DESeq2 tended to beconservative and suffered from large false-negative rates.However, the proposed method with Bartlett’s correc-tion, BALLI, preserved the nominal type-1 error ratesand was the most powerful method other than LLI. Un-less sample sizes were small, LLI controlled the type-1error rates as well and was the most powerful method.Therefore, we recommend using LLI if the sample size issufficiently large (e.g., larger than 40); otherwise, it isbetter to use BALLI.Furthermore, we evaluated the effects of library size
variations on statistical analyses. We found that librarysize variance could affect the estimated type-1 errorrates, and the effect was the largest for voom. Librarysizes are affected by multiple factors, such as the amountof mRNA and the sequencing instrument, which cangenerate substantial variation among library sizes for
subjects. Our simulation studies showed that BALLI wasrobust with regard to library size variation in samples ofvarious sizes and was a reasonable choice if large librarysize variance was observed.The proposed methods assumed that log-cpm values
of read counts asymptotically followed a normal distri-bution and that their variances were approximatelyequal to 1/μ + ϕ with first order approximation. Inaddition, voom considered log-cpm value as a responseand assumed that they were normally distributed. How-ever, our simulation studies revealed the superiority ofthe proposed methods compared with voom, which wasfound to be attributable to their different variance struc-tures. For the proposed methods, 1/μ + ϕ was derivedfrom the first order approximation of the negative bino-mial distribution and thus may be a natural assumptionfor RNA-seq data. Furthermore, for 1/μ + ϕ, ϕ obviouslyindicates the overdispersion parameter, and biologicaland technical variances can be estimated with BALLI.However, voom assumes ϕ/μ, and the amount attribut-able to biological or technical variances cannot be clearlydefined.We also suggested the most flexible and general linear
mixed model for log-cpm. The proposed model assumedthat the variance of log-cpm was φ/μ + ϕ and had themost generalized variance parameter space. Incorpor-ation of φ = 1 yielded BALLI and LLI, and ϕ = 0 yieldedvoom. We found that BALLI was the most efficient inthe considered scenarios; however, in real data analyses,various factors affected variance structure. For example,subjects with different ethnicities can cause φ to belarger than 1, and thus, a better model may differ ac-cording to RNA-seq data. φ and ϕ can be estimatedwith the proposed linear mixed model by implement-ing only a simple modification, and thus, we can
Table 4 Significant genes in all methods of Holstein milk data. Gene lists of Holstein milk data siginificant in nominal 0.005significant level for all methods (BALLI, DESeq2, edgeR, LLI, and voom) and their p values (FDRs) are provided
BALLI DESeq2 edgeR LLI voom
SPTSSB 4.457 × 10−4 (9.222 × 10− 1) 2.660 × 10− 3 (9.997 × 10− 1) 2.160 × 10− 4 (1) 8.686 × 10− 5 (1.787 × 10− 1) 1.037 × 10− 3 (9.999 × 10− 1)
NOS3 4.676 × 10− 4 (9.222 × 10− 1) 2.396 × 10− 4 (6.298 × 10− 1) 2.549 × 10− 4 (1) 8.957 × 10− 5 (1.787 × 10− 1) 8.693 × 10− 4 (9.999 × 10− 1)
FXYD3 7.198 × 10−4 (9.507 × 10− 1) 2.813 × 10− 4 (6.298 × 10− 1) 5.182 × 10− 4 (1) 1.513 × 10− 4 (2.012 × 10− 1) 2.097 × 10− 3 (9.999 × 10− 1)
SPESP1 1.103 × 10− 3 (9.507 × 10− 1) 4.834 × 10− 3 (9.997 × 10− 1) 1.669 × 10− 3 (1) 2.579 × 10−4 (2.385 × 10− 1) 4.001 × 10− 3 (9.999 × 10− 1)
CHST1 1.339 × 10− 3 (9.507 × 10− 1) 8.624 × 10− 4 (9.383 × 10− 1) 1.872 × 10− 3 (1) 3.166 × 10− 4 (2.385 × 10− 1) 6.108 × 10− 4 (9.999 × 10− 1)
LEPREL1 1.387 × 10− 3 (9.507 × 10− 1) 2.554 × 10− 3 (9.997 × 10− 1) 3.297 × 10− 3 (1) 3.334 × 10− 4 (2.385 × 10− 1) 1.487 × 10− 3 (9.999 × 10− 1)
JUB 1.486 × 10− 3 (9.507 × 10− 1) 1.755 × 10− 3 (9.997 × 10− 1) 2.445 × 10− 3 (1) 3.674 × 10− 4 (2.385 × 10− 1) 2.804 × 10− 3 (9.999 × 10− 1)
MIA 1.509 × 10− 3 (9.507 × 10− 1) 4.210 × 10− 4 (6.298 × 10− 1) 2.247 × 10− 3 (1) 3.666 × 10− 4 (2.385 × 10− 1) 2.432 × 10− 3 (9.999 × 10− 1)
C4BPA 2.241 × 10− 3 (9.999 × 10− 1) 3.190 × 10− 4 (6.298 × 10− 1) 1.307 × 10− 3 (1) 6.000 × 10− 4 (2.951 × 10− 1) 2.866 × 10− 3 (9.999 × 10− 1)
CLDN6 2.653 × 10− 3 (9.999 × 10− 1) 1.282 × 10−3 (9.997 × 10− 1) 2.414 × 10− 3 (1) 7.377 × 10− 4 (3.270 × 10− 1) 1.542 × 10− 3 (9.999 × 10− 1)
PALMD 3.737 × 10− 3 (9.999 × 10− 1) 1.151 × 10− 3 (9.997 × 10− 1) 2.648 × 10−3 (1) 1.127 × 10− 3 (3.776 × 10− 1) 3.032 × 10− 3 (9.999 × 10− 1)
KLK12 3.840 × 10− 3 (9.999 × 10− 1) 2.766 × 10− 3 (9.997 × 10− 1) 9.826 × 10− 4 (1) 1.225 × 10− 3 (3.776 × 10− 1) 3.162 × 10− 3 (9.999 × 10− 1)
Park et al. BMC Genomics (2019) 20:540 Page 12 of 14
choose the best model using AIC or LRTs. The se-lected models can then be utilized to identify DEGs.This model was implemented as an R package andcan be downloaded from CRAN (http://cran.r-project.org) or http://healthstat.snu.ac.kr/software/balli/. Fur-thermore, in contrast to methods based on a general-ized linear mixed model such as MACAU [27], theproposed methods can be easily extended to variousscenarios with a simple modification. For example, re-peatedly observed data or multivariate phenotypes canbe analyzed by adding some random effects. Maximiz-ing the likelihood for negative binomial or Poissondistributions with random effects is computationallyintensive, but the proposed methods can easily obtainvariance parameter estimates using existing R pack-ages, such as lme4 and nlme.With simulation studies for various scenarios, we
showed that the proposed methods were usually themost efficient. However, Bartlett’s correction seemsinappropriate when N ≤ 6 (Additional files 3, 4, and 6)and the corrected LR statistic was smaller thanexpected in some cases, especially in simulation stud-ies based on a negative binomial distribution (Table 2).Further studies are necessary to adjust this. Addition-ally, results from simulation studies obviouslydepended on various factors. Our results were ob-tained from simulation data based on RNA-seq datafrom Nigerian individuals and Holstein cows RNA-seqdata and random samples from negative binomial dis-tributions, but any systematic differences in RNA-seqdata could generate different results, depending on se-quencing errors or differences in preparation steps.Multiple studies have revealed some possible differ-ences in these relationships, and our conclusionsbased on simulation studies could be limited to theconsidered scenarios. However, despite such limita-tions, we believe that our results illustrate the prac-tical value of the proposed methods. Further studiesare needed to confirm our findings and expand on thework presented herein.
ConclusionsIn this article, we proposed likelihood-based linearmixed model approaches with and without Bartlett’s cor-rection to analyze more complicated RNA-seq data. Theproposed methods consider log-cpm values of genes asresponse variables, and technical and biological vari-ances are estimated with a linear mixed model. Accord-ing to our simulation studies and real data analysis, ourmethods are statistically more efficient than existingmethods and correctly control the type-1 error rates.We found that the statistical performance of ourmethod, BALLI and LLI, depends on the sample size; we
recommend using LLI if the sample size is larger than40 and otherwise using BALLI.
Additional files
Additional file 1: The forms of Cg's components depending on thestructure of Vg. (DOCX 17 kb)
Additional file 2: Steps of generating simulation data. (DOCX 18 kb)
Additional file 3: Estimated type-1 error rates with simulation data for N= 4, 6, 8, 28, 40, 64, and 68 based on Nigerian people’s data. (DOCX 21kb)
Additional file 4: Estimated type-1 error rates with simulation data for N= 4, 6, 8, 12, 16 and 20 based on Holstein cow’s data. (DOCX 20 kb)
Additional file 5: Estimated powers and precisions with simulationdata when δ = 0.5σ or 1σ and N = 12, 16 and 20 based on Holsteincow’s data. (DOCX 197 kb)
Additional file 6: Estimated type-1 error rates with simulation data for N= 4, 6, 8, 28, 40, 64, and 68 based on simulated data from negative bino-mial distribution. (DOCX 21 kb)
Additional file 7: Effect of varying library sizes on the type-1 error rateswhen u = 0.2, 0.4, 0.6, 0.8, and 1 and N = 12, 16, 20, 24, 28, 40, 64, or 68at the 0.005 nominal significance level. (DOCX 371 kb)
Additional file 8: Effect of varying library sizes on the statistical powerand precision when u = 1, 0.8, 0.6, or 0.4, δ = 1σ and N = 12, 16, 20, 24,28, 40, 64, or 68. (DOCX 593 kb)
Additional file 9: Analysis of covariables, parity and lactation period, fortrue nine DEGs of Holstein cow's data. (DOCX 18 kb)
Additional file 10: Computation time for each method when analyzingHolstein cow’s data. (DOCX 14 kb)
AbbreviationsBALLI: Bartlett-Adjusted Likelihood based LInear mixed model approach;CPM: Counts Per Million; DEGs: Differentially Expressed Genes; FDR: FalseDiscovery Rate; Log-CPM: Log-transformed CPM; LOWESS: Locally WEightedScatterplot Smoothing; LRT: Likelihood Ratio Test; MLE: Maximum LikelihoodEstimate; qRT-PCR: quantitative Real Time Polymerase Chain Reaction; RNA-seq: RNA sequencing
AcknowledgementsSungho Won and Taesung Park contributed equally to this work ascorresponding authors.
Authors’ contributionsKP, TP, and SW designed and led the study, and wrote the manuscript. KPdeveloped BALLI R package. JA, JG, MS, and WL advised on statisticalmethods and data analysis. All authors have read and approved the finalversion of the manuscript.
FundingThis research was supported by a grant of the Korea Health Technology R&DProject through the Korea Health Industry Development Institute (KHIDI),funded by the Ministry of Health & Welfare, Republic of Korea (HI16C2037)and the National Research Foundation of Korea (2017M3A9F3046543). Thefunding body was not involved in the study design, data collection, analysisand interpretation, and writing the manuscript.
Availability of data and materialsWe downloaded Nigerian samples from ReCount website (http://bowtie-bio.sourceforge.net/recount/countTables/montpick_count_table.txt) and Holsteinsamples from Gene Expression Omnibus (GSE60575).
Ethics approval and consent to participateData used in this article were publicly available.
Consent for publicationNot applicable.
Park et al. BMC Genomics (2019) 20:540 Page 13 of 14
Competing interestsThe authors declare that they have no competing interests.
Author details1Interdisciplinary Program of Bioinformatics, Seoul National University, Seoul08826, South Korea. 2Department of Public Health Science, Seoul NationalUniversity, Seoul 08826, South Korea. 3Department of Biomedical Science,Chosun University, Gwangju 61452, South Korea. 4Channing Division ofNetwork Medicine, Brigham and Women’s Hospital, Boston, MA, USA.5Department of Medicine, Harvard Medical School, Boston, MA, USA.6Department of Statistics, Inha University, Incheon 22212, South Korea.7Department of Statistics, Seoul National University, Seoul 08826, SouthKorea. 8Institute of Health and Environment, Seoul National University, Seoul08826, South Korea.
Received: 23 October 2018 Accepted: 28 May 2019
References1. Okoniewski MJ, Miller CJ. Hybridization interactions between probesets in
short oligo microarrays lead to spurious correlations. BMC Bioinf. 2006;7:276.2. Dobbin KK, Kawasaki ES, Petersen DW, Simon RM. Characterizing dye bias in
microarray experiments. Bioinformatics. 2005;21(10):2430–7.3. Zhao S, Fung-Leung W-P, Bittner A, Ngo K, Liu X. Comparison of RNA-Seq
and microarray in transcriptome profiling of activated T cells. PLoS One.2014;9(1):e78644.
4. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping andquantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008;5(7):621–8.
5. Dillies MA, Rau A, Aubert J, Hennequet-Antier C, Jeanmougin M, Servant N,Keime C, Marot G, Castel D, Estelle J, et al. A comprehensive evaluation ofnormalization methods for Illumina high-throughput RNA sequencing dataanalysis. Brief Bioinform. 2013;14(6):671–83.
6. Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y. RNA-seq: anassessment of technical reproducibility and comparison with geneexpression arrays. Genome Res. 2008;18(9):1509–17.
7. Bullard JH, Purdom E, Hansen KD, Dudoit S. Evaluation of statistical methodsfor normalization and differential expression in mRNA-Seq experiments.BMC Bioinf. 2010;11(1):94.
8. Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package fordifferential expression analysis of digital gene expression data.Bioinformatics. 2010;26(1):139–40.
9. Love MI, Huber W, Anders S. Moderated estimation of fold change anddispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15(12):550.
10. Law CW, Chen Y, Shi W, Smyth GK. Voom: precision weights unlock linearmodel analysis tools for RNA-seq read counts. Genome Biol. 2014;15(2):R29.
11. Robinson MD, Smyth GK. Moderated statistical tests for assessing differencesin tag abundance. Bioinformatics. 2007;23(21):2881–7.
12. Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied tothe ionizing radiation response. Proc Natl Acad Sci U S A. 2001;98(9):5116–21.
13. McCarthy DJ, Chen Y, Smyth GK. Differential expression analysis ofmultifactor RNA-Seq experiments with respect to biological variation.Nucleic Acids Res. 2012;40(10):4288–97.
14. Litière S, Alonso A, Molenberghs G. The impact of a misspecified random-effects distribution on the estimation and the performance of inferentialprocedures in generalized linear mixed models. Stat Med. 2008;27(16):3125–44.
15. Chavance M, Escolano S. Misspecification of the covariance structure ingeneralized linear mixed models. Stat Methods Med Res. 2016;25(2):630–43.
16. Seyednasrollah F, Laiho A, Elo LL. Comparison of software packages fordetecting differential expression in RNA-seq studies. Brief Bioinform. 2015;16(1):59–70.
17. Soneson C, Delorenzi M. A comparison of methods for differentialexpression analysis of RNA-seq data. BMC Bioinf. 2013;14:91.
18. Bartlett MS. Properties of sufficiency and statistical tests. Proc R Soc Lond AMath Phys Sci. 1937;160(901):268–82.
19. Robinson MD, Oshlack A. A scaling normalization method for differentialexpression analysis of RNA-seq data. Genome Biol. 2010;11(3):R25.
20. Cox DR, Reid N. Parameter orthogonality and approximate conditionalinference. J R Stat Soc Ser B Methodol. 1987;49(1):1–18.
21. Zucker DM, Lieberman O, Manor O. Improved small sample inference in themixed linear model: Bartlett correction and adjusted likelihood. J R Stat SocSeries B Stat Methodol. 2000;62(4):827–38.
22. Melo TF, Ferrari SL, Cribari-Neto F. Improved testing inference in mixedlinear models. Comput Stat Data Anal. 2009;53(7):2573–82.
23. Brent RP. Algorithms for minimization without derivatives; 1973.24. Pickrell JK, Marioni JC, Pai AA, Degner JF, Engelhardt BE, Nkadori E, Veyrieras
JB, Stephens M, Gilad Y, Pritchard JK. Understanding mechanismsunderlying human gene expression variation with RNA sequencing. Nature.2010;464(7289):768–72.
25. Frazee AC, Langmead B, Leek JT. ReCount: a multi-experiment resource ofanalysis-ready RNA-seq gene count datasets. BMC Bioinf. 2011;12:449.
26. Seo M, Kim K, Yoon J, Jeong JY, Lee H-J, Cho S, Kim H. RNA-seq analysis fordetecting quantitative trait-associated genes. Sci Rep. 2016;6:24375.
27. Sun S, Hood M, Scott L, Peng Q, Mukherjee S, Tung J, Zhou X. Differentialexpression analysis for RNAseq using Poisson mixed models. Nucleic AcidsRes. 2017;45(11):e106.
Publisher’s NoteSpringer Nature remains neutral with regard to jurisdictional claims inpublished maps and institutional affiliations.
Park et al. BMC Genomics (2019) 20:540 Page 14 of 14