supplementary materials and methods for · 2018-09-24 · (supplementary table 2a and supplementary...
TRANSCRIPT
SUPPLEMENTARY NOTE
Genome size estimation by K-mer and flow cytometry analysis
K-mer means a sequence with k nucleotides. The K-mer statistics was used to give
discrete probability distributions of a number of possible K-mer combinations1. We
counted the copy number of a given K-mer (17-mer) presented in sequence reads to
divide the total length of sequence reads, then plotted the distribution of copy number. The
K-mer distribution can be used to infer the genome size. The peak value of the frequency
curve represents the overall sequencing depth. The algorithm should be represented as:
(N×(L-K+1)-B)/D = G, where N is the total number of sequence reads, L is the average
length of sequence reads, and k is K-mer length. To minimize the influence of sequencing
error, K-mers with low frequency (< 3) are discarded. B is the total number of low
frequency 17-mers, D is the overall depth estimated from K-mer distribution, and G
denotes the genome size. The peak frequency of 17-mers is about 50X depth for B.
juncea (Supplementary Fig. 1) and 45X depth for B. nigra (Supplementary Fig. 5).
In addition, we employed flow cytometry2 analysis to estimate genome size of B. juncea.
The genome of O. sativa (Nipponbare) is taken as control3 (supplementary Table 1). The
genome size of B. juncea is a little bit less than previous published estimation (from 984 to
1006 Mb) by using flow cytometry analysis without control analysis4,5.
High-throughput sequencing
Whole genome sequencing for B. juncea and B. nigra
A B. juncea var tumida inbred line (T84-66) with excellent agronomic traits and widely
Nature Genetics: doi:10.1038/ng.3657
used as a parent in hybrid breeding and a B. nigra double haploid line (YZ12151) were
used for the reference genome sequencing. The genomic DNAs were extracted from
leaves with a standard CTAB extraction method. Genomic sequences were generated
using Illumina HiSeq™ 2000 & 2500 sequencing platforms with PE and MP libraries
(Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X
coverage of genome sequences from 17 B. juncea cultivars consisting of 10 vegetable-
and 7 oil-use sub-varieties were generated for crop usage selection analysis
(Supplementary Table 24). Low depth (<1 X) genome sequencing of 27 representative B.
rapa accession were generated for A-subgenome of B. juncea research (Supplementary
Table 25).
Single-molecule sequencing of B. juncea based on PacBio platform
The total DNA was extracted from the leaves. Nanodrop, Qubit2.0 and gel electrophoresis
were used to assess the DNA purity, concentration and integrity respectively. Seven μg
total DNA was used to construct a 20 kb DNA library for PacBio RS II platform (PacBio,
USA) sequencing according to the standard protocol. A total of 11.09 Gb SMART data
were generated, covering about 12 X of the B. juncea genome after QC (Supplementary
Figure 2).
Genome (Optical) maps of B. juncea based on IRYS system
Young leaves were preprocessed according to IrysPrepTM Plant Tissue-Nuclei protocol
and the DNA was extracted in line with IrysPrepTM Plus Long DNA Isolation protocol. DNA
with concentration ranges from 30 to 200 ng/ul and total amount of 300 ng was used to
precede the further experiments. The nicking, labelling, repair and staining processes
Nature Genetics: doi:10.1038/ng.3657
were performed in strict accordance with the IrysPrepTM Labeling-NLRS (300 ng) protocol.
A total of 996,648 BioNano molecules were obtained with total length reaches 205 Gb,
covering around 222 X of the B. juncea genome. The optical maps were assembled using
Irys-scaffloding with default parameters and 922 optical maps were obtained with average
length of 1.19 Mb (Supplementary Table 4).
RNA-seq of B. juncea and B. nigra
Total RNAs of each tissue (root, stem, leaf, flower and silique) were extracted from the B.
juncea and B. nigra according to the instruction manual of the Trizol Reagent (Life
technologies, California, USA). Equal amounts of the high quality RNA samples from each
tissue were then pooled together for cDNA library construction of B. juncea and B. nigra
respectively. Approximately 11.56 Gb and 4 Gb transcriptomic data were generated for B.
juncea and B. nigra respectively using Illumina HiSeq™ 2000 sequencing platform
(Illumina, USA) with standard pipeline. The usable reads (after removing low quality reads)
obtained from all samples were de novo assembled using Trinity6 (Supplementary Table
20a and Table 20b).
Genome assembly and annotation for B. juncea and B. nigra
Raw data preprocess
In order to facilitate the assembly, a series of checking and filtering measures
corresponding to different platforms were performed.
The following criteria were used to filter Illumina low-quality reads:
1. Filter reads in which unknown nucleotides 'N' > 5%.
Nature Genetics: doi:10.1038/ng.3657
2. Filter low quality reads in which average PHRED-like score < 20% (representing that
sequencing error rates < 1%).
3. Clip bases whose PHRED score < 20 at the end of reads. Reads less than 30 bp
would be discarded after low quality bases clipping for long library reads.
4. Filter reads with adapter contamination. Reads with more than 10bp aligned to the
adapter sequence (allowing less than or equal to 3bp mismatch) were removed.
In all, 1,653,212 raw reads were produced by 11 Pacbio cells. Then the reads with
quality < 0.75 or length < 500 bp were filtered and a total of 626,640 reads were retained.
Next the SMART reads were corrected by Ectools7 using the ALLPATHS-LG assembled
contigs with default parameters and corrected reads longer than 3 kb were retained.
The basic handlings of BioNano raw data were preceded using IrysView package.
Molecules with length > 100 kb, label SNR >= 3.0 and average molecule intensity < 0.6
were retained for further genome assembling.
Genome assembly for B. juncea
First, all the Illumina reads after the above filtering and correction steps were used for de
novo assembly by the ALLPATHS-LG8 with the default parameters. Then all the corrected
Pacbio RS II reads were used to fill the gaps by PBjelly_V15.2.209 with parameters:
--minGap 1 -minMatch 8 -minPctIdentity 70 -bestn 1 -nCandidates 20 -maxScore -500
-nproc 5 -noSplitSupreads, which resulted in a genome with total size 784 Mb. Next,
RefAligner utility in IrysView was used to perform alignment between Irys molecules and
draft assemblies for correcting the scaffolds’ chimera error. We were expected to break
the scaffolds at the gaps nearest to the candidate enzyme sites (Supplementary Figure 3).
Nature Genetics: doi:10.1038/ng.3657
The candidate break sites were screened out under the following circumstances: 1.
scaffolds are longer than 100 kb with more than 20 enzymes sites; 2. candidate sites are
not located at the edges (< 10 sites or <50 kb from the start or end base ) of a scaffold; 3.
candidate sites are “covered across” by less than 3 molecules (“cover across” means
more than 2 sites matched at both sides of the candidate break sites); 4. more than 3
enzyme sites between two candidate break sites. Altogether, 180 scaffolds were
disconnected at 233 candidates break sites. Finally, the corrected scaffolds were
anchored to the optical maps.
Genome assembly for B. nigra
For B. nigra genome assembly, the Illumina high quality reads were used for de novo
assembly by the software ALLPATHS-LG8 with the default parameters. The software
GapCloser (GapCloser v1.12 for SOAPdenovo10) was used to fill gaps and improve the
quality of the scaffolds by comparison with short paired-end libraries (inserted size < 1Kb).
Genome quality assessment
We searched the CEGMA v.2.3 method11 which including 458 conserved Core eukaryotic
genes (CGE database12) to assess the completeness of finial genome assembly of B.
juncea and B. nigra (Supplementary Table 14a).
The assembled genome of B. juncea and B. nigra was also validated by mapping
23,002 ESTs and 18344 ESTs (length >=500 bp) downloaded from NCBI (GenBank) to
the corresponding genome (Supplementary Table 14b).
To assess the accuracy of the B. juncea and B. nigra genome assembled by HiSeq
sequencing data, we randomly selected 10 sub-reads longer than 40 Kb from PacBio data
Nature Genetics: doi:10.1038/ng.3657
for B. juncea and downloaded 15 BAC sequence from GeneBank for B. nigra. Firstly, 10
sub-reads were mapped to B. juncea genome using BlasR53 and 15 BAC sequences were
anchored to assembly genome using blastn for B. nigra. Then, the blastn results were
chained to larger syntenic region to identify corresponding scaffolds for each BAC. Finally
the formulas (coverage = alignment length/BAC or subread length; identity = matched
length / BAC or subread length without gap) were used to calculate the coverage and
identity for each sub-read and BAC sequence (Supplementary Table 12 and 13).
Furthermore, to inspect the paired end relationship for B. juncea and B. nigra, the mate
pair reads (3/5/10/15k for B. juncea, 3/5/10K for B. nigra) were mapped to whole
assembly genome using SOAP13(Supplementary Figure 6).
Genetic maps and pseudo-chromosome construction of B. juncea and B.
nigra
Genetic map of B. juncea
We constructed a reference genetic map of B. juncea based on genotyping by
whole-genome resequencing for F2 population14,15. Two parental inbred lines of
near-isogenic homozygous T84-63 (paternal line and the reference cultivar in our genome
sequencing project) and B. juncea var. napiformis homozygous line ‘03A0106’ (maternal
line) were chosen to develop a F2 mapping population. In total, 100 individuals were
randomly selected from F2 population for segregation analysis and genetic mapping
(Supplementary Table 8). PE reads generated from two parental lines and 100 F2 lines
resequencing through Illumina HiseqTM 2000 platform were aligned to T84-63 draft
Nature Genetics: doi:10.1038/ng.3657
genome using BWA16 with default parameters. Potential SNP were identified by GATK
v3.417. Before genotyping, the following criteria were applied to reduce false discovery
rate of SNPs. 1) remove SNPs with effective depth lower than 10 for paternal line and 6
for maternal line; 2) remove SNPs with MAF < 0.05; 3) remove copy number >1.5. Due to
the low coverage sequencing data of F2 lines, in order to improve the data integrity,
genotype of offsprings was imputed using LB-Impute software18, and the Markov trellis
window was set to a length of 5. After imputation, SNPs with integrity lower than 0.7 was
filtered out, a marker set of 62580 SNPs was obtained. Pair-wise recombination of this
marker set on each scaffold were calculated, adjacent SNPs with pair-wise recombination
rate less than 0.001 were lumped into a genetic bin, after excluding bins showing
significantly distorted segregation (Chi-square test, P-value < 0.01). A final set of 5333 bin
markers was grouped to 18 linkage groups (Supplementary Table 9) using Highmap
software19.
Assignment of subgenomes and pseudochromosome construction
We sorted BjuA and BjuB subgenomes of B. juncea referred on the genetic map
(T84/DTC) constructed in this paper and SY/PM publishe20. Genome assembly was
assigned to the corresponding sub-genomes of B. juncea according to the integrated
information of two above genetic maps. Allmaps software21 was used to construct the
initial pseudo-chromosomes of B. juncea from scaffolds using genetic map T84/DTC and
SY/PM. For those scaffolds un-anchored genetically, synthetic relationships between B.
juncea and their ancestral genomes B. rapa and B. nigra were investigated after
assignment of subgenoms. The final pseudo-chromosomes were constructed combining
Nature Genetics: doi:10.1038/ng.3657
the information of genetic map T84/DTC and SY/PM and the synthetic map of genetically
un-anchored scaffolds (Supplementary Fig. 5).
Genetic map of B. nigra
AllMaps software21 was also used to construct the initial pseudo-chromosomes of B.
nigra from scaffolds using the linkage group of T84/DTC.
Repeats annotation
The repeats sequence of B. juncea and B. nigra genome were distinguished with a
combination of de novo and homolog strategies. The results from four de novo programs
including RepeatScout22, LTR-FINDER23, MITE24 and PILER25 were merged as the initial
repeat library. The initial repeat database was classified into classes, subclasses,
superfamilies and families by the PASTEClassifier.py script included with REPET26. We
then merged TE sequences of Brassica species (B. juncea, B. nigra, B. rapa, B. oleracea
and B. napus) and the known repbase database27 together to construct a new repeat
database. Finally this new repeat database was used to distinguish the genome assembly
repeat sequences through RepeatMasker28 (Supplementary Table 15).
Gene model prediction and evaluation
Genes were annotated iteratively using three main approaches: homology-based (H), de
novo (D) and EST/unigenes-based (C). Results of these three methods were integrated
by the GLEAN29 to get high confidence gene model by combing all evidence.
Homology-based method (H): Protein sequences from 2 sequenced eudicot species: A.
thaliana and B. rapa from the public database, were used to perform prediction. We used
Nature Genetics: doi:10.1038/ng.3657
the GeneWise (v2.2.0)30 to determine the accurate gene structure. For de novo prediction,
we used Augustus with parameters trained by unigenes from transcriptome data,
Genscan31 and GlimmerHMM32 with Arabidopsis parameters to obtain de novo gene
models. In the third approach, unigenes were aligned to the genome assembly using
BLAT (identity >= 0.95, coverage >= 0.90) and then filtered using PASA.
After combining all evidence to generate gene model by glean29, RNA-seq-based
method mapping transcriptome data to the reference genome using TopHat33 and
assembling transcripts with Cufflinks33 was adopted to obtain the gene structures and new
genes. We filtered short gene mode (< 150 bp) and single exon gene mode to generate
final gene set for further analysis (Supplementary Table 18).
Gene model evaluation
The resultant gene set contains 80,050 protein-coding gene models, with a mean CDS
size of 1,111.07 bp and an average of 4.57 exons per gene. We used the RNA-seq data to
evaluate the gene model predication (Supplementary Table 20).
Gene function annotation
Gene functions were assigned according to the best match of the alignments against
various protein database using BLASTP (E-value = 1e-5), including the non-redundant
protein (Nr) database, Swiss-Prot database. Furthermore, unigenes were searched
against the NCBI non-redundant nucleotide sequence (Nt) database using BLASTN by a
cut-off E-value= 1e-5. Gene were retrieved based on the best BLAST hit (highest score)
Nature Genetics: doi:10.1038/ng.3657
along with their protein functional annotation. InterProScan was run on the gene models to
provide a list of INTERPRO domains34,35 and GO terms for each B. juncea gene. In order
to predict the most probable function of the genes, all genes were aligned (E-value= 1e-5)
with KEGG proteins, and the pathways were considered present for B. juncea as long as
there were matches to B. juncea genes. The gene sequences were also aligned to the
Clusters of Orthologous Group (COG) database to predict and classify functions. Kyoto
Encyclopedia of Genes and Genomes (KEGG) pathways were assigned to the assembled
sequences using the online KEGG Automatic Annotation Server (Supplementary Table
19).
Non-coding RNA annotation
tRNAscan-SE (version 1.23) was applied to detect reliable tRNA positions and other
non-coding RNAs (ncRNAs) were predicted by software Infernal using default
parameters36,37. Through comparing the second structure similarity between B. juncea, B.
nigra genome and database Rfam (v12.0) 38, the ncRNAs were classified into different
families (Supplementary Table 21).
TE content comparison between allopolyploid subgenomes and its
diploid parents
To further increase the accuracy and precision for the comparison of TEs between the
sub-genomes and their ancestors, only TEs located in corresponding syntenic regions
without gaps(BjuA-BraA, BjuA-BnaA, BjuB-BniB or BolC-BnaC) were consideration. This
Nature Genetics: doi:10.1038/ng.3657
stringent rule could effectively reduce the influence of assembly quality. The detailed
method about syntentic blocks identification see Section ”Lost gene identification and
classification”.
Newly formed TE identification after divergence from its ancestors
The following criteria were used to identify new TE for BjuA subgenome (Supplementary
Fig 8).
1. Filter out simple sequence repeats and short sequences (<200 bp) from BjuA repeat
annotation result.
2. For each TE instance, we selected a pair of markers sequence with length of 200bp
(purple block in Supplementary Fig. 8) which were located 1 kb upstream from the start
site of TE and 1 kb downstream from the end site of TE (M: blue block). Then the paired
marker sequence were searched against BraA genome using BLASTN (Evalue < 1E-5).
The strategy could ensure that the marker sequences located in non-repeat region.
3. In order to obtained highly confident result, we only retained paired-markers
satisfying the following criteria: 1) the paired-markers found high conserved match
sequence in BraA genome (identity > 90% and matched length > 180 bp); 2) the
paired-markers are located in same chromosome; 3) the distance between
paired-markers is shorter than 2*M+TE (purple block in BraA).
4. According to the distance between paired markers mapping to the BraA genome
(purple block in Supplementary Fig. 8), the TE can be been classified into four
circumstance clarified as below.
Nature Genetics: doi:10.1038/ng.3657
A) If the distance between the paired markers in BraA was similar with BjuA counterpart
and TE in both BjuA and BraA belonged to the same TE category, the TE is regarded as
common TE.
B) If the distance between the paired markers was shorter than 20 bp, the TE is
regarded as a high confident new TE in BjuA because the TE is absent in BraA.
C) If the distance between the paired markers in BraA is approximately equal to the
distance between paired markers in BjuA (distance contained in to 2*(M-L)-30 and 2*(M-L)
+ 30, L: length of TE less annotated), the TE in BjuA is regarded as Annotation less TE.
D) If the distance between the paired markers in BraA was approximately equal to the
distance between paired markers in BjuA (distance contained in to 2*(M+L)-30 and
2*(M+L) + 30, L: length of genome sequence less assembly), the TE in BjuA is regarded
as Assembly less TE.
Same strategy was applied to identify the new TE in subgenomes of B. juncea, B.
napus compared to their corresponding ancestral genome after divergence from common
ancestor (Supplementary Table 16a and 16b).
Newly formed TEs were proofed by PCR amplification by using degenerated primers at
upstream and downstream of TEs in B. jucnea (Supplementary Fig. 9).
Newly formed TE model in allopolyploid genome (AABB and AACC)
The newly formed BjuA TE come from two sources, one come from intra-subgenome
transposition, the other come from BjuB as inter-subgenome transposition. The new TE
as query was to search B. juncea genome of A and B ancestor genomes by BLASTN. The
Nature Genetics: doi:10.1038/ng.3657
sequence homology show us the new TE come from A ancestror as intra-subgenome
transposition or from B ancestor as inter-subgenome transposition. The alignment results
that the length of new TE sequence sharing less than 50% sequence identity of it were
filtered. All TE categories were identified according to the criteria.
1. Whether the difference of sequence identity between subgenomes of B. juncea less
than threshold (5%),
2. The origins of TEs were separated according to the identity between subgenomes of
B. juncea. If the identity difference below 5% the new TE is considered to be common
category. Using the same approach, we re-annotated new TEs in the B. juncea genome,
and identified cross transposition TEs (Supplementary Table 17).
Gene losses in the reference genome
To call synteny blocks, we performed all-against-all BLASP (E-value=1e–5)39 and chained
the BLASP hits by QUOTA-ALIGN40 (cscore=0.5) with “1:1 synteny screen”. At least 4
gene pairs were required for synteny block and two adjacent synteny blocks were merged
together if the distance less than 20 gene paired between each other. The “1:3 synteny
screen” model were used to identified synteny block between A. thaliana and Brassica
because of whole genome triplication41 in Brassica evolution history by QUOTA-ALIGN
(cscore = 0.5).
To search the Brassica ancestral common gene sets (Supplemental Table 29), we
performed a pairwise synteny comparisons with each other (BraA, BniB, BjuA, BjuB, BolC,
BnaA, BnaA, BnaC, Ath) for each species, collecting a set of syntenic matches. All loss
genes identifications were based on Bracssica ancestor common gene sets of each
Nature Genetics: doi:10.1038/ng.3657
species. We focused on gene sets that were located within the identified syntenic blocks
between the BraA and BjuA. If we could not find an annotated gene within the syntenic
blocks, then we search the gene CDS sequence against the entire BjuA genome using
BLASTN (E-value = 0.01, identity =90%). The gene which has not an ortholog to BjuA was
regarded as lost to the ancestral gene of BraA in BjuA subgenome. The procedure used
allowed confident filtering of candidate lost genes, where one BjuA homeologous gene
copy or one parental gene copy was missing at the DNA sequence level from genome
assemblies. The best BLASTN DNA sequence match, found elsewhere in the genome,
was the corresponding homeolog (if in BjuA genome) or ortholog (if in BraA genomes).
We further studied cases that sequence match out of syntenic blocks. These cases that
match to other syntenic blocks were identified through following method. We used a more
appropriate splice-aware aligner GMAP42 to align the diploid coding sequences in the
syntenic region and checked if the aligned ancestral gene model retained a complete
open reading frame in the ancestral. If BLASTN DNA sequence matches at orthologous
positions with no annotation or a gene could be predicted by geneid software43, then the
sequences were blat to the loss gene. When the sequence length of matched to loss gene
more than 70%, the gene was predictable. In order to find real gene loss, if the length of
loss gene sequence sharing less than 20% sequence identity of itself length were
regarded as whole genes lost missing DNA sequences. The genes were eventually
labeled as ‘partial loss’ if the mapped gene model lacked a start or stop codon, or
‘pseudogenes’ if there were internal stop codons. Following this stringent analysis, we
found an initial set of 303 candidate lost genes (where the DNA sequence was missing)
Nature Genetics: doi:10.1038/ng.3657
and 845 candidate part-loss genes and 583 candidate pseudogenes in the BjuA assembly
as compared to the corresponding parental genome. Similarity, other subgenomes (BniB,
BjuB, BolC, BnaA, BnaC) from Brassica species were selected to confirm the ancestral
gene sets to seek gene loss (Supplementary Table 22).
Validation of gene loss
To exclude false positive lost gene, we mapped (uniquely) raw Illumina reads (~26X) from
BjuA to the ancestral genome BraA. Each of the BjuA “missing syntenic genes” was
confirmed as:
(a) All the above identified 303 BjuA missing genes (no DNA sequence found) were
carefully checked for confirmation based on raw sequence read coverage (less <5% than
genome sequencing depth) (Supplementary Table 22). This confirmed the highly
confident deletion of 156 genes. These were detected because the average depth after
mapping BjuA raw sequence was lower than expected.
(b) Not deleted where normal sequence read coverage similar to the average of the
genome was observed, such as truncation and pseudogenization of genes.
We mapped (uniquely) raw Illumina reads from BjuA RNA-seq to the genome BjuA.
Each of the truncation and pseudogenization of genes of BjuA was confirmed as partial
deleted, based on no RNA-seq read coverage on its genome. All the above identified 845
BjuA partial missing genes (no RNA sequence found) were carefully checked for
confirmation based on raw RNA-seq read coverage to identified 349 highly confident
part-loss gene (Supplementary Table 18). Sequence changes resulted in disruption of
Nature Genetics: doi:10.1038/ng.3657
open reading frames and therefore the corresponding gene model was considered “partial
lost”, but remnants of the genes still retain some sequence similarities to the ancestral
genes. Similarity, other subgenomes (BniB, BjuB, BolC, BnaA, BnaC) from Brassica
species were selected to count the number of gene loss (Supplementary Table 22).
We then randomly selected 20 gene loss events (20 non-loss) and validated them using
PCR amplification, of which most gene loss events were confirmed by PCR amplification.
We think that might be caused by missing assembly in genome or non-specific
amplification of target genes because of possible homological genes in polyploidy or. The
primers used in this validation were listed in Supplementary Table 34.
Gene expression calculation and homoeolog expression dominance
identification
Gene expression calculation
The clean reads that were filtered from the raw reads were mapped onto B. juncea
genome using Tophat244. The top 200 results of alignment will be exported when multiple
reads map to the same locations by TopHat2. Gene expression levels of individual genes
were quantified using RPKM values (fragments per kilobase of exon per million fragments
mapped) by the Cufflinks45.
Homoeolog expression dominance gene between subgenomes of B. juncea
The homolog expression bias was performed within syntenic gene pairs. Differentially
expressed genes pairs that pass the 2 fold change threshold are regarded as dominant
gene pairs, either A dominance or B dominance. The dominant genes are the genes that
Nature Genetics: doi:10.1038/ng.3657
expressed relatively higher in dominant gene pairs, and the lower ones are subordinate
genes. The rest of syntenic gene pairs that shows non-dominance are classified as
neutral genes. The number of A dominant gene pairs, B dominant gene pairs and
Non-dominant gene pairs are shown in Supplementary Table 27b. To test whether the
occurrences of an A dominant gene pair and the occurrences of B dominant gene pair are
equal, we perform double-side binomial tests on dominant gene pairs for all samples54,55
(Supplementary table 27b).
Selective pressure on dominantly expressed genes and subgenomes
All SNPs set were called by GATK17 for 17 B. juncea accessions with default parameters
and filtered out with depth < 3X (Supplementary Table 25). Then CDS sequence set was
reconstructed based on high quality SNPs for each sample. To detect selective pressure
acting on each coding gene, the rates of nonsynonymous (dN) and synonymous (dS) (ω
=dN/dS) substitutions were estimated site-by-site using the YN00 program with default
parameters from the PAML 4.2b package46. Each paired gene sets of 17 samples were
estimated repeatedly. All Ka/Ks of gene pairs were classified to three categories
(dominant genes, subordinated genes and neutral genes). Meanwhile All Ka/Ks of gene
pairs was separated into BjuA/BjuB subgenome. In order to test statistical significance of
different data sets, we perform a permutation test on them with 1000 permutations.
Boxplots were carried out to study the difference of selective pressure among three
category genes and between subgenomes.
Nature Genetics: doi:10.1038/ng.3657
Diversification of A-subgenome of B. juncea and B. napus
Phylogenetic reconstruction for A-subgenomes in Brassica
We called variations from resequencing of 18 B. juncea accessions A-subgenomes
including B. juncea reference A-subgenomes, 5 B. napus accessions including 1 B. napus
reference A-subgenomes, and 27 B. rapa accessions including 1 B. rapa reference
sequence that cover most subspecies of B. rapa. B. rapa genome was considered as
reference genome for all resequencing accessions. BWA16 and GATK17 was used to call
SNPs from resequencing data for 18 B. juncea accession, 5 B. napus accessions with
default parameters (Supplementary Table 25). We filtered out the SNPs with depth <3X.
BWA16 and Samtools47 were used to call variations from resequencing data of B. rapa with
default parameters. Ungenotyped SNP loci were imputed by the KNN algorithms48. SNPs
with MAF > 0.05 were picked out for further analysis. In this step, a total of 198,497 SNPs
were initially screened out and only non-hete SNPs with integrity > 0.6 were kept for tree
construction. To build the tree, all SNPs from resequencing samples represented most B.
rapa subvarieties were concatenated as alignment sequennce by referring to B. rapa
genome. Then the neighbor-joining tree for A-subgenomes in Brassica population was
constructed by MEGA v6.0 using Kimura 2-parameter model with 1000 bootstraps and
default parameters.
Principal component analysis
A total of 198,497 SNPs were initially identified from A-subgenomes with same method
and resequencing assessions described above. Only 51,116 high quality SNPs with
integrity >= 0.8 and MAF>=0.05 were selected for principal component analysis. The
Nature Genetics: doi:10.1038/ng.3657
EIGENSOFT package56 combines functionality from the population genetics methods and
EIGENSTRAT stratification correction method. We used STRATPCA software from
EIGENSOFT package to implement principal components analysis with 51,116 genetic
markers. Principal component analysis displayed that A-subgenomes of vegetable- and
oil-use subvarieties of B. juncea were distributed nearby B. rapa ssp. tricolaris group and
far from other sub-species of B. rapa supporting its ancestor is evolved from one variety of
B. rapa as B. rapa ssp. tricolaris (Supplementary figure 13).
Characteristics of SNP variations from A-subgenomes of B. juncea and B. napus,
and vegetable- and oil-use B. juncea
B. rapa genome was taken as the reference for SNP calling. Total of 4,589,419 SNPs from
18 B. juncea, including 11 vegetable- and 7 oil- use and 5 B. napus samples were
simultaneously identified using the same method described above. To compare the
characteristics of the SNPs from B. juncea and B. napus, 6 B. juncea samples including 3
vegetable- and 3 oil- use samples(CN53, CN58, CN04 and CN02, EU07, AU213) and 5 B.
napus samples were set as B. juncea and B. napus groups respectively. For the
characteristics of vegetable- and oil- use B. juncea, The 11 vegetable- and 7 oil- use B.
juncea samples were set as vegetable- and oil- use groups. We only kept the SNP locus
with full integrity( integrity =1) for further analysis. Finally, a total of 992,788 SNPs for B.
juncea and B. juncea groups, and 1,716,765 SNPs for vegetable- and oil- use groups
were considered. The B. juncea and B. napus groups domiant SNP (polymotphic SNP)
was defined as that the frequency of alleles >=60% in B. juncea and B. napus groups and
different to the reference. The B. juncea specific SNP (fixed SNP) was defined as that the
Nature Genetics: doi:10.1038/ng.3657
frequency of alleles >=60% in B. juncea subgroups and the genotype was different
between two B. juncea and B. napus subgroups and different to the reference. Four
frequency of alleles scale (60%, 70%, 80%, 90%) were carried out for SNPs. Same
strategy was used for vegetable- and oil- use dominant and specific SNP analysis
(Supplementary figure 14).
Formation time estimation for B. juncea and B. napus
The average CDS length of B. juncea and their progenitor, B. napus and their progenitor is
around 1000bp (Supplementary Table 29). One mutation in CDS means its Ks value is
0.003 corresponding to 0.1 Mya approximately. For the Ks distribution of BraA vs BjuA,
BniB vs BjuB, BraA vs BnaA, BolC vs BnaC, artificial peaks would be found in
Ks-distribution plot which may mislead the diverge time calculation (Supplementary Fig.
12). We selected all syntenic gene pairs hadone synonymous substitution site and
calculated their Ks by PAML and KaKs_calculator to validate this assumption. The result
demonstrated that the Ks method was not appropriate for divergence time estimation in a
short period.
There were inherent flaw of Ks method in divergence time calculation for newly formed
species such as subgenomes of B. juncea and their parents. To estimate the formation
time of B. juncea, we firstly selected BjuA and its closest relative genome, BjuA and the
earliest divergent B. juncea accessions referred to phylogenetic tree of B.rapa population
(Figure 2b). Same strategy was applied for B. napus. Then we reconstructed CDS
sequences for selected samples from resequencing data. After multiple sequence
alignment by MUSCLE v3.349, phylogenetic tree was constructed and divergence time
Nature Genetics: doi:10.1038/ng.3657
was estimated by Bayesian MCMC analyses in BEASTv1.850 with JIT nucleotide
substitution model, relaxed log normal clock model, 1 million MCMC generations from
which parameters were sampled every 1000 generations and other default parameters.
One calibration time (4.6±0.5 MYA) for B. oleracea in previous publication51 was
adopted as outgroup calibration point to estimated diverge time for B. juncea and B.
napus. We calculated the divergence time between BjuA and its closest relative genome
(tricolaris in red bold line of Figure 2b) as the upper limit of formation time. And the
divergence time between BjuA and the earliest divergent B. juncea accessions (B. juncea
in red bold line of Figure 2c) was considered as the lower limit of formation time.
Accordingly, we referred BnaA and its closest relative genome (European rapa in blue
bold line of Figure 3a) as the upper limit of formation time, BnaA and the earliest divergent
B. napus accessions (B. napus in blue bold line of Figure 2b) as the low limit of formation
time (Figure 2c).
Detection of selective sweep signals
Average pair-wise diversity (π) and population differentiation statistic (Fst) were
calculated based on the reference SNPs using the Genome Analysis Toolkit (GATK)
V3.452 using default parameters. Selective sweep regions were identified in the 10
vegetable- and 7 oil-use B. juncea sub-varieties by combining Fst outliers and π ratio
outliers (θπ (vegetable-use/oils-ues)). Calculations of π ratios and Fst were based on 100
kb sliding windows with 10 kb steps. The genomic windows where the average Fst fell in
the top 5% of the empirical Fst distribution were defined as the Fst outliers. Similarly, the
Nature Genetics: doi:10.1038/ng.3657
genomic windows where the π ratio fell in the top 5% of the empirical π ratio distribution
were defined as the outliers. Adjacent windows extended to 10Kb likely represent the
effect of a single divergence region and thus were linked to define a ‘candidate gene
region’ (Supplementary Tables 29, 30 and 31).
Nature Genetics: doi:10.1038/ng.3657
SUPPLEMENTARY REFERENCES
1. Li, R. et al. The sequence and de novo assembly of the giant panda genome. Nature
463, 311–317 (2010).
2. Ohmido, N. et al. Quantification of total genomic DNA and selected repetitive
sequences reveals concurrent changes in different DNA families in indica and japonica
rice. Mol. Gen.Genet. 263, 388–394 (2000).
3. International Rice Genome Sequencing. The map-based sequence of the rice genome.
Nature 436, 793–800 (2005).
4. Aurmuganathan, K. & Earle E.D. Nuclear DNA content of some important plant
species. Plant Mol. Biol. Rep. 9, 208–218 (1991).
5. Johnston, J.S. et al. Evolution of genome size in Brassicaceae. Ann. Bot. 95, 229–235
(2005).
6. Haas, B.J. et al. De novo transcript sequence reconstruction from RNA-seq using the
Trinity platform for reference generation and analysis. Nat. Protoc. 8, 1494–1512
(2013).
7. Lee, H. et al. Error correction and assembly complexity of single molecule sequencing
reads. BioRxiv DOI, 10.1101/006395 (2014).
8. Maccallum, I. et al. ALLPATHS 2, small genomes assembled accurately and with high
continuity from short paired reads. Genome Biol. 10, R103 (2009).
9. English, A.C. et al. Mind the gap, upgrading genomes with Pacific Biosciences RS
long-read sequencing technology. PLoS One 7, e47768 (2012).
10. Luo, R. et al. SOAPdenovo2, an empirically improved memory–efficient short–read de
novo assembler. Giga Sci. 1, 18 (2012).
11. Parra, G., Bradnam, K. & Korf I. CEGMA, a pipeline to accurately annotate core genes
in eukaryotic genomes. Bioinformatics 23, 1061–1067 (2007).
12. Ye, Y.N., Hua, Z.G., Huang, J., Rao, N. & Guo FB. CEG, a database of essential gene
clusters. BMC Genomics 14, 769 (2013).
13. Gu, S., Fang, L. & Xu X. Using SOAPaligner for short reads alignment. Curr. Protoc. in
Bioinformatics DOI, 10.1002/0471250953 (2013).
Nature Genetics: doi:10.1038/ng.3657
14. Huang, X. et al. High-throughput genotyping by whole-genome resequencing. Genome
Res. 19, 1068–1076 (2009).
15. Mun, J. et al. Construction of a reference genetic map of Raphanus sativus based on
genotyping by whole-genome resequencing. Theor. Appl.Genet.128, 259–272 (2015).
16. Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler
transform. Bioinformatics 25, 1754–1760 (2009).
17. DePristo, M.A. et al. A framework for variation discovery and genotyping using
next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011).
18. Fragoso, C.A., Heffelfinger, C., Zhao, H. & Dellaporta, S.L. Imputing genotypes in
Biallelic populations from low-coverage sequence Data. Genetics 202, 487–495
(2016).
19. Liu, D.Y. et al. Construction and analysis of high–density linkage map using
high-throughput sequencing data. PLos One 9, e98855 (2014).
20. Zou, J. et al. Co-linearity and divergence of the A subgenome of Brassica
juncea compared with other Brassica species carrying different A subgenomes. BMC
Genomics 17, 18 (2016).
21. Tang, H. et al. ALLMAPS, robust scaffold ordering based on multiple maps. Genome
Biol. 16, 3 (2015).
22. Price, A.L., Jones, N.C. & Pevzner, P.A. De novo identification of repeat families in
large genomes. Bioinformatics 21 S1, i351–i358 (2005).
23. Xu, Z. & Wang, H. LTR_FINDER, an efficient tool for the prediction of full-length LTR
retrotransposons. Nucleic Acids Res. 35, W265–268 (2007).
24. Han, Y. & Wessler S.R. MITE-Hunter, a program for discovering miniature
inverted-repeat transposable elements from genomic sequences. Nucleic Acids Res.
38, e199 (2010).
25. Edgar, R.C. & Myers, E.W. PILER, identification and classification of genomic repeats.
Bioinformatics 21 S1, i152–i158 (2005).
26. Wicker, T., Matthews, D.E. & Keller, B. TREP, a database for Triticeae repetitive
elements. Trends Plant Sci. 7, 561–562 (2002).
27. Bao, W., Kojima, K.K. & Kohany, O. Repbase Update, a database of repetitive
Nature Genetics: doi:10.1038/ng.3657
elements in eukaryotic genomes. Mobile DNA 6, 11 (2015).
28. Chen N. in Current protocols in Bioinformatics Version 1. (eds Andreas D Baxevanis)
1–14 (John Wiley & Sons, Inc, 2004).
29. Elsik, C.G. et al. Creating a honey bee consensus gene set. Genome Biol. 8, R13
(2007).
30. Birney, E., Clamp, M. & Durbin, R. GeneWise and Genomewise. Genome Res. 14,
988–995 (2004).
31. Burge, C. & Karlin, S. Prediction of complete gene structures in human genomic DNA.
J. Mol. Biol. 268, 78–94 (1997).
32. Allen, J.E., Majoros, W.H., Pertea, M. & Salzberg, S.L. JIGSAW, GeneZilla, and
GlimmerHMM, puzzling out the features of human genes in the ENCODE regions.
Genome Biol. 7 S1, S9 1–13 (2006).
33. Trapnell, C. et al. Differential gene and transcript expression analysis of RNA-seq
experiments with TopHat and Cufflinks. Nature Protoc. 7, 562–578 (2012).
34. Hunter, S. et al. InterPro in 2011, new developments in the family and domain
prediction database. Nucleic Acids Res. 40, D306–312 (2012).
35. Hunter, S. et al. InterPro, the integrative protein signature database. Nucleic Acids Res.
37, D211–215 (2009).
36. Lowe, T.M. & Eddy, S. tRNAscan-SE, a program for improved detection of transfer RNA
genes in genomic sequence. Nucleic Acids Res. 25, 955–964 (1997).
37. Nawrocki, E.P. & Eddy, S.R. Infernal 1.1, 100-fold faster RNA homology searches.
Bioinformatics 29, 2933–2935 (2013).
38. Nawrocki, E.P. et al. Rfam 12.0, Updates to the RNA Families Database. Nucleic Acids
Res. 43, D130–D137 (2015).
39. Kielbasa, S.M., Wan, R., Sato, K., Horton, P. & Frith, M.C. Adaptive seeds tame
genomic sequence comparison. Genome Res. 21, 487–493 (2011).
40. Tang, H. et al. Screening synteny blocks in pairwise genome comparisons through
integer programming. BMC Bioinformatics 12, 102 (2011).
41. Wang, X. et al. The genome of the mesopolyploid crop species Brassica rapa. Nat.
Genet. 43, 1035–1039 (2011).
Nature Genetics: doi:10.1038/ng.3657
42. Wu, T.D. & Watanabe, C.K. GMAP, a genomic mapping and alignment program for
mRNA and EST sequences. Bioinformatics 21, 1859–1875 (2005).
43. Blanco, E., Parra, G. & Guigo, R. in Current Protocols in Bioinformatics Version 2
(editoral board, Andreas D Baxevanis) 1–28 (John Wiley & Sons, Inc, 2007).
44. Kim, D. et al. TopHat2, accurate alignment of transcriptomes in the presence of
insertions, deletions and gene fusions. Genome Biol. 14, R36 (2013).
45. Trapnell, C. et al. Transcript assembly and quantification by RNA-Seq reveals
unannotated transcripts and isoform switching during cell differentiation. Nat. Biotech.
28, 511–515 (2010).
46. Yang, Z. PAML 4, phylogenetic analysis by maximum likelihood. Mol. Biol. Evol. 24,
1586–1591 (2007).
47. Li, H. et al. The sequence alignment/map format and SAMtools. Bioinformatics 25,
2078–2079 (2009).
48. Chen, W. et al. Genome-wide association analyses provide genetic and biochemical
insights into natural variation in rice metabolism. Nat. Genet. 46, 714–721 (2014).
49. Edgar, R.C. MUSCLE, multiple sequence alignment with high accuracy and high
throughput. Nucleic Acids Res. 32, 1792–1797 (2004).
50. Drummond, A.J., Suchard, M.A., Xie, D. & Rambaut, A. Bayesian phylogenetics with
BEAUti and the BEAST 1.7. Mol. Biol. Evol. 29, 1969–1973 (2012).
51. Liu, S. et al. The Brassica oleracea genome reveals the asymmetrical evolution of
polyploid genomes. Nat. Commun. 5, 3930 (2014).
52. McKenna, A. et al. The Genome Analysis Toolkit, a MapReduce framework for
analyzing next–generation DNA sequencing data. Genome Res. 20, 1297–303 (2010).
53. Mark J Chaisson and Glenn Tesler. Mapping single molecule sequencing reads using
Basic Local Alignment with Successive Refinement (BLASR), theory and application.
BMC Bioinformatics 13,238 (2012).
54. Feng, C. et al. Biased gene fractionation and dominant gene expression among the
Nature Genetics: doi:10.1038/ng.3657
subgenomes of Brassica rapa. Plos One 7,e36442 (2012).
55. Schnable, J.C., Springer, N.M. & Freeling, M. Differentiation of the maize subgenomes
by genome dominance and both ancient and ongoing gene loss. Proc. Natl. Acad. Sci.
USA 108, 4069–4074 (2011).
56. Price A L, Patterson N J, Plenge R M, et al. Principal components analysis corrects for
stratification in genome-wide association studies. Nat. Genet. 38, 904–909 (2006).
Nature Genetics: doi:10.1038/ng.3657
SUPPLEMENTARY FIGURES
Figure 1. K-mer 17 distribution of B. juncea
Nature Genetics: doi:10.1038/ng.3657
Figure 2. Plot of sub-reads length distribution of PacBio sequencing data.
Nature Genetics: doi:10.1038/ng.3657
Figure 3. Scaffold chimera error sketch map during genome assembly
Nature Genetics: doi:10.1038/ng.3657
Figure 4. Pseudo-chromosomes of B. juncea genome. Pseudo-chromosomes were
constructed from two genetic maps of T84-DTC and SY-PM using ALLMAPS with equal
weights. Green lines connected the B. Juncea genome scaffolds to the linkage group of
T84-DTC and yellow lines connected the B. juncea genome scaffolds to the linkage group
of SY-PM.
Nature Genetics: doi:10.1038/ng.3657
Figure 5. K-mer 17 distribution of B. nigra.
Nature Genetics: doi:10.1038/ng.3657
Figure 6. Genome assembly assessment of PacBio sub-reads, BAC sequence and
paired-end reads. a, PacBio sub-reads and paired-end reads (inserted Size: 4Kb, 5Kb,
10Kb, 15Kb) alignment result mapping to B. juncea genome sequence. b, Alignment of
BACs and paired-end reads (inserted Size: 4Kb, 5Kb, 10Kb, 15Kb) alignment result
mapping to the B. nigra genome sequence.
Nature Genetics: doi:10.1038/ng.3657
Figure 7a. Comparison of the main transposable element (TE) types in syntenic region of
B. juncea subgenomes and their ancestors (B. rapa and B. nigra)
Figure 7b. Comparison of the main transposable element types in syntenic region of B.
napus subgenomes and their ancestors (B. rapa and B. oleracea).
Nature Genetics: doi:10.1038/ng.3657
Figure 8. Procedure for identifying newly formed transposable elements (TEs).
Nature Genetics: doi:10.1038/ng.3657
Figure 9. PCR amplification of newly identified transposable elements (TEs) in B. rapa, B.
nigra and B. juncea.
Nature Genetics: doi:10.1038/ng.3657
Figure 10a. Comparison of newly formed transposable elements (TEs) in B. juncea
subgenomes and their ancestors (B. rapa and B. nigra).
Figure 10b. Comparison of newly formed transposable elements (TEs) in B. napus
subgenomes and their ancestors (B. rapa and B. oleracea).
Nature Genetics: doi:10.1038/ng.3657
Figure 11. Gene loss validation using PCR amplification.
Nature Genetics: doi:10.1038/ng.3657
Figure 12a. Gene ontology of lost genes from the BjuA of B. juncea.
Figure 12b. Gene ontology of lost genes from the in BjuB of B. juncea.
Nature Genetics: doi:10.1038/ng.3657
Figure 13. a
Nature Genetics: doi:10.1038/ng.3657
Figure 14. Characteristics of SNP variations from A subgenomes of B. juncea and B.
napus, and vegetable- and oil-use B. jucnea.
Nature Genetics: doi:10.1038/ng.3657
Figure 15. Estimate of molecular divergence between two B. juncea subgenomes, two B.
napus subgenomes and their progenitors (B. rapa, B. nigra, B. oleracea).
Nature Genetics: doi:10.1038/ng.3657
Figure 16. Venndiagram of homoeolog expression dominance in four different
developmental stages of B. juncea. After seeding on Oct. 5th, the stem of yongan1 was
collected 18 weeks after seeding, which is represented by blue; the stem of yongan2 was
collected 20 weeks after seeding, represented by orchid; the stem of yongan3 was
sampled 22 weeks after seeding, corresponding to the green color; the stem of yongan4
was obtained 24 weeks after seeding, as shown in yellow. The total number of dominance
genes in each area was indicated by black numbers. The red stands for the number of
BjuA genes, while the blue stands for the number of BjuB genes
Nature Genetics: doi:10.1038/ng.3657
Figure 17. Venndiagram of homoeolog expression dominance in different tissues of B.
juncea. Seed coat, stems and stems from mutants are represented by blue, green and
yellow respectively. The stems of both daye3bianzhong and yongan3 were sampled 22
weeks after seeding which is one week before the start of inflation. The total numbers of
BjuA and BjuB genes are shown in red color and blue color. The total number in each area
is indicated in black.
Nature Genetics: doi:10.1038/ng.3657
Figure 18. KEGG analysis of genes exhibiting homoeolog expression dominance in B.
jucnea.
Nature Genetics: doi:10.1038/ng.3657
Figure 19. Boxplot of the distribution of Ka values between subgenomes (BjuA and BjuB)
and among homoeolog expression dominance genes as dominant, subordinate and
neutral (non-dominant) in B. juncea. We performed a permutation test with 1000
permutations to assess statistical significance of difference (P < 0.001).
Nature Genetics: doi:10.1038/ng.3657
Figure 20. Genome-wide homoeolog expression dominance in B. juncea. All numbers
represent the IDs of specific genes that are located in regions identified in the select
sweep analysis between vegetable- and oil-use sub-varieties. Genes involved in
glucosinolate and lipid processes were marked with different colored triangles. Genes
expressed dominantly in either subgenome are marked with a black arrow. Meanwhile,
gene losses are marked with blue rectangles.
Nature Genetics: doi:10.1038/ng.3657
Figure 21. Heat map for genes involved in auxin and ethylene signal pathways in
vegetable-use (highlighted in green) and oil-use (highlighted in orange) sub-varieties of B.
juncea from the RNA-Seq data.
Nature Genetics: doi:10.1038/ng.3657
SUPPLEMENTARY TABLES
Table 1. Genome estimation by flow cytometry for B. juncea
Peak value CV (%) Genome size (Mb) O. sativa (Nipponbare)/B. juncea 30.82 /71.55 6.45 /6.52 903.1 O. sativa (Nipponbare)/B. juncea 27.63 /65.85 7.81 /7.76 927.1 O. sativa (Nipponbare)/B. juncea 29.8/71.71 5.33 /5.58 936.1
Mean 922.1
Nature Genetics: doi:10.1038/ng.3657
Table 2a. Summary of genome sequencing strategy for B. juncea
Paired-end library Insert size Total data (Gb) Depth (×)* Q20 (%)
Illumina reads
180bp 22.56 26.2 92.53 250bp 39.55 46.0 95.4 500bp 14.09 16.4 92.46 3Kbp 15.24 17.7 95.54 3Kbp 16.43 19.1 95.25 5Kbp 17.18 20.0 78.52 8Kbp 7.81 9.1 98.14 8Kbp 3.17 3.7 78.38
10Kbp 2.71 3.2 79.84 10Kbp 4.97 5.8 81.69 15Kbp 1.83 2.1 79.1 15Kbp 2.35 2.7 81.11 17Kbp 3.29 3.8 80.74
Total 151.19 175.8 91.09
Note: The estimated genome size was 0.86 Gb.
Table 2b. Statistic of PacBio sub-reads length distribution
Total read number
Base number (bp)
Depth (X) Sub reads N50 (bp)
Average sup reads (bp)
Longest sub-reads
(bp) 1,053,835 11,088,237,501 12.03 13,981 10,522 74,870
Nature Genetics: doi:10.1038/ng.3657
Table 3. Summary of B. juncea genome assembly
strategy Contig Contig Scaffold
Hiseq sequencing
Total length (bp) 640,594,512 701,290,321
Total number 48,985 11,891
Max length (bp) 326,883 6,107,082
N50 size (bp) 28,225 710,138
N90 size (bp) 6,024 83,000
Hiseq sequencing + PacBio
Total length (bp) 760,709,244 784,227,516
Total number 32,581 10,784
Max length (bp) 569,668 4,561,631
N50 size (bp) 61,273 855,041
N90 size (bp) 12,728 94,898
Hiseq sequencing + PacBio + BioNano
Total length (bp) 955,000,958 (gap 194,291,714)
Total number 10,581
Max length (bp) 7,842,264
N50 size (bp) 1,523,604
N90 size (bp) 124,389
Note: Scaffolds length less than 1000 bp are excluded.
Nature Genetics: doi:10.1038/ng.3657
Table 4. Summary of BioNano data collection and assembly statistics
No. molecule/genome
maps Total length
Coverage
Molecule/map N50
Longest molecule/map (bp)
Average molecule/scaffold
Single molecules (> 150 kb)
996,648 205 Gb 222 X 217 Kb 3,174,225 206 Kb
Map assembly 922 1,101 Mb - 1.84 Mb 11,038,396 1.19 Mb
Nature Genetics: doi:10.1038/ng.3657
Table 6. Summary of genetic map of B. juncea from resequencing of F2 population
Linkage group
Chromosome
Marker number
Total genetic distance (cM)
Average genetic distance (cM)
A01 J01 369 330.59 0.9 A02 J02 259 232.13 0.9 A03 J03 412 293.09 0.71 A04 J04 290 195.34 0.67 A05 J05 299 237.3 0.79 A06 J06 231 222.9 0.96 A07 J07 231 218.12 0.94 A08 J08 243 215.23 0.89 A09 J09 497 431.52 0.87 A10 J10 156 103.02 0.66 B01 J11 164 166.91 1.02 B02 J12 455 406.05 0.89 B03 J13 399 414.27 1.04 B04 J14 170 173.91 1.02 B05 J15 358 346.55 0.97 B06 J16 182 137.28 0.75 B07 J17 226 190.53 0.84 B08 J18 392 395.51 1.01 total total 5,333 4,710.25 0.88
Nature Genetics: doi:10.1038/ng.3657
Table 7. Summary of a published genetic map of B. junceaa
Linkage group
Chromosome Marker number
Total genetic distance (cM)
Average genetic distance (cM)
A01 J01 96 89.7 0.93 A02 J02 102 69.24 0.68 A03 J03 103 94.93 0.92 A04 J04 61 63.3 1.04 A05 J05 82 93.53 1.14 A06 J06 98 98.02 1 A07 J07 99 71.5 0.72 A08 J08 88 64.11 0.73 A09 J09 110 107.97 0.98 A10 J10 71 68.42 0.96 B01 J11 62 56.42 0.91 B02 J12 120 103.59 0.86 B03 J13 80 59.83 0.75 B04 J14 86 115.44 1.34 B05 J15 66 102.17 1.55 B06 J16 54 75.79 1.4 B07 J17 139 102.24 0.74 B08 J18 114 123.84 1.09 total 1,631 1,560.04 0.96
aData from a published genetic map of B. juncea (Zou et al., BMC Genomics, 2016, 17:
18).
Nature Genetics: doi:10.1038/ng.3657
Table 8a. Statistics of B. juncea pseudo-chromosomes
Subgenomes Chromoso
me ID Chromosome
size (Mbp) Gaps (%)
Anchored percentage*
(%)
Pearson correlation coefficient
T84-DTC
SY-PM
BjuA (402.12 Mb)
J01 45.32 26.62 11.27 0.96 0.99 J02 25.47 11.80 6.33 0.96 0.90 J03 43.96 8.48 10.93 0.93 0.99 J04 33.18 11.31 8.25 0.94 0.97 J05 38.84 23.74 9.66 0.85 0.99 J06 38.05 12.63 9.46 0.90 0.99 J07 28.55 18.69 7.10 0.93 0.97 J08 29.35 9.15 7.30 0.93 0.99 J09 62.78 17.48 15.61 0.97 0.98 J10 22.35 6.53 5.56 0.94 0.97 NA 34.27 7.38 total 367.85 15.50 91.48
BjuB (547.53 Mb)
J11 32.47 26.12 5.93 0.89 0.99 J12 60.37 22.17 11.03 0.70 0.97 J13 83.57 36.79 15.26 0.60 0.99 J14 28.34 4.80 5.18 0.96 0.97 J15 50.57 20.32 9.24 0.79 0.99 J16 18.74 1.20 3.42 0.84 0.99 J17 44.22 28.69 8.08 0.72 0.97 J18 77.67 36.23 14.19 0.66 0.99 NA 151.58 19.38 total 395.95 26.59 72.32 0.96 0.99
Unknown 5.25 Mbp B. juncea
(954.90 Mb) 763.80 79.99
* Percentage of scaffold length anchored to pseudochromosomes. NA: non-anchored
Nature Genetics: doi:10.1038/ng.3657
Table 8b. Statistics of B. nigra pseudochromosomes
Chromoso
me ID
Chromosome size
(Mb)
Gaps (%)
Anchored percentage*
(%)
Pearson correlation coefficient
T84-DTC SY-PM
B. nigra (402.05Mbp)
B01 24.69 8.12 6.14 0.90 0.8 B02 37.16 9.26 9.24 0.77 0.99 B03 40.13 10.11 9.98 0.75 0.96 B04 27.51 6.57 6.84 0.85 0.99 B05 33.55 11.10 8.34 0.80 0.95 B06 24.10 5.67 5.99 0.90 0.99 B07 32.34 5.88 8.04 0.97 0.97 B08 47.31 8.09 11.77 0.86 0.98 NA 130.07 14.95 total 266.79 8.30 66.36
* Percentage of scaffold length anchored to pseudochromosomes. NA: non-anchored
Nature Genetics: doi:10.1038/ng.3657
Table 9. Summary of subgenome size in B. juncea, B. nigra and B. rapaa
B. juncea B. rapa B. nigra
BjuA BjuB Unknow
n BraA BniB
genome size (Mb) 402.12 547.53 5.25 283.82 396.86 gene number 40,256 39,414 380 41,020 47,974
TE content (bp) 115,995,819 197,386,909 2,737,16
8 82,607,905 149,046,980
aB. rapa genome was from Wang et al.
Nature Genetics: doi:10.1038/ng.3657
Table 10. Summary of genome sequencing strategy for B. nigra
Paired-end library Insert size Total data (Gb) Depth (×)* Q20 (%)
Illumina reads
180bp 11.17 18.6 95.98 200bp 11.55 19.3 96.23 300bp 4.59 7.6 94.28 400bp 5.87 9.8 92.02 3Kbp 7.19 12.0 79.1 4Kbp 4.76 7.9 79.15 5Kbp 4.77 8.0 80.41 8Kbp 1.37 2.3 81.07 10Kbp 4.03 6.7 80.56 15Kbp 2.28 3.8 81.56
Total 57.59 95.99 88.75
Note: The estimated genome size was 0.591 Gb.
Nature Genetics: doi:10.1038/ng.3657
Table 11. Summary of B. nigra genome assembly
Category BniB
Contig Scaffold
Total length (bp) 354,586,867 396,857,455
Total number 25,103 5,120
Max length (bp) 332,462 5,330,927
N50 size (bp) 31,119 557,272
N90 size (bp) 6,778 66,961
Note: Scaffolds length less than 1000 bp were excluded
Nature Genetics: doi:10.1038/ng.3657
Table 12. PacBio sub-reads validation for B. juncea genome assembly
Sub-reads ID
Sub-read length (bp)
Alignment length (bp)
Coveragea (%) Identityb
(%) 1 46984 46984 100.00% 96.39% 2 45676 45676 100.00% 96.13% 3 44076 42926 97.39% 99.47% 4 42564 42490 99.83% 95.68% 5 42545 42545 100.00% 99.14% 6 42331 42331 100.00% 99.39% 7 41857 41857 100.00% 98.69% 8 41658 41658 100.00% 99.92% 9 41653 41653 100.00% 99.97%
10 41529 41529 100.00% 99.89%
a. Coverage = Alignment length / sub-reads length;
b. Identity = Identical length / Alignment length.
Nature Genetics: doi:10.1038/ng.3657
Table 13. BAC validation for B. nigra genome assembly
Accession BAC length (bp) Number of N a Status Alignment length (bp) Coverageb (%) Identityc (%)
KC795992.1 51,800 0 random 51,800 100.00 99.97 KC795993.1 73,424 500 ordered 70,562 96.76 99.80 KC795994.1 47,925 0 random 47,925 100.00 99.95 KC795995.1 102,525 0 random 102,098 99.58 99.40 KC795996.1 163,490 200 ordered 163,250 99.98 99.99 KC795997.1 51,796 0 random 51,796 100.00 99.97 KC795998.1 73,803 0 random 72,848 98.71 99.99 KC795999.1 42,114 0 random 42,114 100.00 100.00 KC796000.1 53,921 0 random 53,059 98.40 99.93 KC796001.1 85,295 0 random 84,340 98.88 99.99 KC796002.1 73,752 0 random 72,797 98.71 99.99 KC796003.1 70,444 0 random 70,444 100.00 99.99 KC796004.1 118,270 900 ordered 107,447 91.55 98.16 KC796005.1 71,086 100 ordered 70,986 100.00 99.98 KC796006.1 56,764 0 random 53,933 95.01 99.98
a. All BACs were from Sharma et al., PloS One, 2014, 9(4): e93260.
b. BACs with N gap indicate they are 'working draft' sequence. They consist of ordered contigs and concatenate with constant Ns.
c. Coverage = Alignment length/(BAC length - N number).
d. Identity = Identical length/Alignment length.
Nature Genetics: doi:10.1038/ng.3657
Table 14a. Completeness inspection based on CEG databse for B. juncea and B. nigra
Species Number of CEGsa Percent of CEGs Number of conserved CEGsb Percent of conserved CEGs B. juncea 453 98.9 % 245 98.8 % B. nigra 458 100 % 248 100 %
a. number of genes within 458 CEGs presented in assembly results b. number of genes within 248 highly conserved CEGs presented in assembly results
Nature Genetics: doi:10.1038/ng.3657
Table 14b. Genome completeness assessment for B. juncea and B. nigra based on EST dataset
Species Dataset Number of EST
Total length of EST (bp)
sequence coverred by one scaffold(>50%)
Number Percent
B. juncea >=500 bp 23,002 31,628,607 22,665 98.53% >=1000
bp 13,152 24,615,370 13,053 99.25%
B. nigra >=500 bp 18,344 25,187,114 17,878 97.46% >=1000
bp 10,729 19,680,640 10,619 98.97%
Nature Genetics: doi:10.1038/ng.3657
Table 16a. Statistics of confident newly formed TEs and common TEs
Categories B. juncea B. napus
BjuA BraAa BjuB BniB BnaA BraAb BnaC BolC Confident 1,108 805 1,063 1,089 977 978 1,267 1,357 Commonc 43,843 38,900 57,980 58,624 37,452 39,682 106,216 111,932
Note: a. Ancestral genome of B. rapa and subgenome of B. juncea. b. Ancestral genome of B. rapa and subgenome of B. napus. c. Same transposable
element (TEs) been found in the syntenic position of the genome.
Nature Genetics: doi:10.1038/ng.3657
Table 16b. Classification of confident newly identified TEs
Categories B. juncea B. napus
BraAa BjuA BniB BjuB BraAb BnaA BolC BnaC ClassI/DIRS 9 17 11 14 12 13 21 25 ClassI/LINE 37 64 45 34 65 70 81 77 ClassI/LTR 8 18 29 24 9 8 18 5 ClassI/LTR/Copia 76 82 107 92 111 61 151 104 ClassI/LTR/Gypsy 53 78 108 105 78 72 145 158 ClassI/PLE|LARD 39 71 61 58 60 67 95 94 ClassI/SINE 24 20 21 18 22 22 45 37 ClassI/SINE|TRIM
0 0 0 0 0 0 0 0
ClassI/TRIM 7 12 10 4 5 7 6 4 ClassI/Unknown 1 1 2 1 3 2 2 3 ClassII/Helitron 7 9 3 0 8 5 14 12 ClassII/MITE 175 223 242 266 200 179 188 211 ClassII/Maverick 1 3 1 2 2 1 2 0 ClassII/TIR 81 115 138 170 114 78 183 149 ClassII/Unknown 11 15 10 19 16 15 24 24 PotentialHostGene
19 7 13 11 10 22 30 35
Unknown 257 373 262 271 263 355 352 329 total 805 1,108 1,063 1,089 978 977 1,357 1,267 Note: a. Ancestral genome of B. rapa and subgenome of B. juncea. b. Ancestral genome of B. rapa and subgenome of B. napus.
Nature Genetics: doi:10.1038/ng.3657
Table 17. Statistics of origination of newly identified TEs
Type B. juncea B. napus
BjuA BjuB BnaA BnaC Intra-subgenome 738 755 339 651 Inter-subgenome 163 147 23 18
Unknown 207 161 599 570
Note: Three origin of new transposable element (TEs) were distinguished (Intra-subgenome= internal TEs, Inter-subgenome = TEs driving from opposite
subgenome, unknown = origin could not be identified).
Nature Genetics: doi:10.1038/ng.3657
Table 18. Summary of predicted genes in B. juncea, B. napus and their ancestors (B. rapa, B. nigra and B. oleracea)
Category BraA BniB BolC B. juncea B. juncea
total
B. napus B. napus total BjuA BjuB Unknow
n BnaA BnaC Unknown
Gene number 41,020 49,826 45,758 40,256 39,414 380 80,050 44,452 56,055 533 101,040
Gene length (bp) 82,756,981
95,089,874
80,521,861
86,175,681
82,169,664
516,981 168,862,326 90,932,969
105,697,258
613,454 197,243,681
Average gene length (bp) 2,017.48 1,908.44 1,759.73 2,140.69 2,084.78 1,360.48 2,109.46 2,045.64 1,885.60 1,150.95 1,952.13
Exon number 206,584 225,634 208,039 188,179 176,679 1,193 366,051 227,265 266,792 1,662 495,719
Exon length (bp) 47,916,296
55,849,862
47,242,763
45,642,107
42,974,822
324,206 88,941,135 47,174,160 53,174,968 312,555 100,661,68
3 Average exon length (bp) 231.95 247.52 227.09 242.55 243.24 271.76 242.97 207.57 199.31 188.06 203.06
CDS length per gene (bp) 1,168.12 1,120.90 1,032.45 1,133.80 1,090.34 853.17 1,111.07 1,061.24 948.62 586.41 996.26
Exon num per gene 5.04 4.53 4.55 4.67 4.48 3.14 4.57 5.11 4.76 3.12 4.91
Intron number 165,564 175,808 162,281 147,923 137,265 813 286,001 182,813 210,737 1,129 394,679
Intron length (bp) 34,651,585
39,196,751
33,217,700
41,671,111
40,192,063
237,626 82,100,800 34,635,213 42,663,705 248,973 77,547,891
Average intron length (bp) 209.29 222.95 204.69 281.71 292.81 292.28 287.06 189.46 202.45 220.53 196.48
Intron number per gene 4.04 3.53 3.55 3.67 3.48 2.14 3.57 4.11 3.76 2.12 3.91
Intron length per gene (bp) 844.75 786.67 725.94 1,035.15 1,019.74 625.33 1,025.62 779.16 761.10 467.12 767.50
Nature Genetics: doi:10.1038/ng.3657
Table 19. Statistics of genes annotated by different databases
Database
B. juncea BniB gene numbera (49,826)
Percentage
Total gene numbera
(80,050)
Percentage
BjuA gene numbera (40,256)
Percentage
BjuB gene numbera (39,414)
Percentage
BjuO gene numbera
(380)
Percentage
NR 77,496 96.81% 39,311 97.65% 37,810 95.93% 359 94.47% 47,056 94.44% SwissProt 57,422 71.73% 29,447 73.15% 27,730 70.36% 232 61.05% 38,135 76.54%
COG 25,522 31.88% 20,588 51.14% 18,905 47.97% 130 34.21% 17,031 34.18% GO 51,906 64.84% 26,746 66.44% 24,968 63.35% 184 48.42% 39,275 78.82%
KEGG 16,016 20.01% 8,243 20.48% 7,724 19.60% 43 11.32% 9,948 19.97% Gene
numberb 78,290 97.80% 39,669 98.54% 38,244 97.03% 361 95.00% 47,186 94.70%
a. Number of genes that have been annotated in a corresponding database
b. Number of genes that have been annotated in a corresponding genome/subgenome
Nature Genetics: doi:10.1038/ng.3657
Table 20a. Summary of unigenes from B. juncea and B. nigra transcriptomes
Unigenes Length (bp) B. juncea B. nigra
Number Percentage Number Percentage 200-300 18,894 34.33% 10,071 29.25% 300-500 13,145 23.88% 6,020 17.48%
500-1000 9,850 17.90% 7,615 22.11% 1000-2000 9,093 16.52% 7,597 22.06%
>2000 4,059 7.37% 3,132 9.10% Total 55,041 100% 34,435 100.00%
Table 20b. Summary of unigene lengths from B. juncea and B. nigra transcriptomes
Species Total unigenes length (bp) N50 (bp)
Average unigenes length
(bp)
B. juncea 41,166,472 1,302 747.92 B. nigra 29,912,959 1,411 868.68
Nature Genetics: doi:10.1038/ng.3657
Table 21. Summary of non-coding RNAs and pseudogenes
Class B. juncea B. nigra
Number Family Number Family lncRNA 21 2 27 21 sRNA 3725 151 80 35 tRNA 2638 56 723 2 rRNA 511 3 62 5 miRNA 1402 830 719 147 snRNA 15418 612 1,395 132 CD-box 10265 332 1,181 92 HACA-box 3164 248 78 31 scaRNA 1189 18 3 1 splicing 800 14 133 8 Pseudogenes 14,676 4,489
Nature Genetics: doi:10.1038/ng.3657
Table 22. Statistic of syntenic orthologs missing in Brassica genome/subgenome comparison
loss type BraA vs BjuA BniB vs BjuB BjuA vs
BraA BjuB vs
BniB BraA vs BnaA
BolC vs BnaC
BnaA vs BraA
BnaC vs BolC
Whole genes missing DNA sequences 183(234) 141(189) 184(303) 208 (279) 81(86) 24(66) 157(304) 162(276)
Sequence matches on random 101 56 199 200 152 301 2303 2624
Transposition 1,495 1,085 3,819 2,617 1893 1785 1465 1187
Sequence matches outside synteny blocks 485 386 865 504 313 356 669 770
Synteny-excluded by block 1,135 823 764 718 538 656 542 866
Gene is predictable 159 109 429 534 65 317 77 116
Partial loss 135(194) 112(169) 349(845) 255(650) 172(279) 370(679) 213(469) 199(411)
Pseudogene 69(106) 29(57) 167(583) 82(351) 205(420) 364(721) 29(74) 27(73)
Gmap failed 8 4 25 14 9 27 36 42
Total potential loss 3,917 2,878 7,832 5,867 3755 4908 5939 6365
Note: (1) Numbers outside brackets are validated by sequencing reads. Numbers inside brackets are predicted. (2) We calculated gene loss number including whole genes missing DNA sequences, partial loss and psequdogene. (3) Gmap failed: Genes lost predicted by Gmap.
Nature Genetics: doi:10.1038/ng.3657
Table 23. Statistics of syntenic region among A-subgenomes of Brassica
Intra-chromosome Inter-chromosome Length (Mb) Percent (%) Length (Mb) Percent (%)
BjuA : BraA 34.18 8.50 43.44 10.80 BnaA : BraA 8.64 2.74 1.3 0.41
Note: (1) intra-chromosome: disorder synteny region between homologous chromosomes (2) inter-chromosome: synteny region between nonhomologous chromosomes
Nature Genetics: doi:10.1038/ng.3657
Table 24. Summary of the 17 resequencing of B. juncea accessions
Accession Sample collection Variety type Reads number Read length (bp) Base Number (bp) Mapped ratio (%) Average
Depth (X) Coverage ratio (%)
CN04 China (Zhejiang) vegetable 345,104,000 101 31,059,360,000 94.75 24 90.2
CN18 China (Zhejiang) vegetable 107,030,406 101 10,810,071,006 96.72 8 85.64
CN40 China (Zhejiang) vegetable 114,726,912 101 11,587,418,112 97.67 9 85.1
CN46 China (Sichuan) vegetable 110,960,828 101 11,207,043,628 97.01 8 84.75
CN48 China (Zhejiang) vegetable 106,201,702 101 10,726,371,902 96.81 8 86.62
CN53 China (Zhejiang) vegetable 109,173,560 101 11,026,529,560 97 9 85.05
CN58 China (Sichuan) vegetable 107,415,060 101 10,848,921,060 97.18 8 84.95
CN59 China (Hebei) vegetable 109,780,920 101 11,087,872,920 94.21 8 82.64
CN78 China (Zhejiang) vegetable 123,577,972 101 12,357,797,200 97.8 11 92.27
CN79 China (Zhejiang) vegetable 107,560,118 101 10,863,571,918 95.69 9 83.71
AU213 Australia oilseed 108,148,800 101 10,923,028,800 96.87 8 83.9
CN02 China (Ningxia) oilseed 102,722,874 101 10,375,010,274 97.4 8 82.18
CN74 China (Tibet) oilseed 105,557,312 101 10,661,288,512 96.71 8 83.99
CN77 China (Tibet) oilseed 173,714,344 101 17,545,148,744 96.41 12 85.77
EU07 France oilseed 100,322,358 101 10,132,558,158 95.99 7 82.22
EU11 Ukraine oilseed 103,255,648 101 10,428,820,448 95.24 7 82.38
IN30 India oilseed 93,927,978 101 9,486,725,778 96.99 7 82.2
Nature Genetics: doi:10.1038/ng.3657
Table 25. Summary of SNP variations in 17 B. juncea accessions, 4 B. napus accessions and 26 B .rapa accessions
Species Samples Main_usage Hetero_ratio Integrity Total_SNPs Synonymous_SNPs Non_synonymous_SNPs
B.juncea
BjuA vegetable 7.48% 86.92% 1,518,243 209,363 115,164 CN59 vegetable 36.84% 82.68% 1,925,241 226,224 121,531 CN40 vegetable 33.02% 85.98% 662,198 69,891 38,632 CN46 vegetable 25.49% 85.38% 997,120 111,394 60,297 CN48 vegetable 41.91% 87.75% 445,939 35,286 21,666 CN53 vegetable 28.45% 85.54% 1,121,548 119,990 65,760 CN58 vegetable 27.02% 85.36% 949,864 102,278 56,783 CN04 vegetable 26.54% 91.77% 1,336,000 122,785 67,077 CN18 vegetable 25.55% 85.71% 1,084,829 114,249 62,565 CN02 oilseed 22.07% 81.83% 1,518,012 170,440 93,243 EU07 oilseed 20.64% 81.97% 1,568,160 174,909 95,530 AU213 oilseed 21.73% 84.15% 1,560,846 174,249 94,803 CN74 oilseed 45.43% 83.69% 1,881,218 210,599 115,801 EU11 oilseed 28.13% 81.98% 1,638,861 185,494 101,561 IN30 oilseed 31.10% 81.62% 1,213,381 136,492 75,111 CN77 oilseed 47.78% 86.76% 2,097,376 216,523 119,518 CN79 vegetable 19.30% 84.27% 1,845,092 198,589 106,901 CN78 vegetable 40.53% 94.60% 667,413 42,459 25,453
B.napus
Darmor-bzh oilseed 12.42% 70.98% 1,114,029 139,660 80,137 Yudal oilseed 16.33% 92.79% 1,493,504 59,834 44,084 Bristol oilseed 15.49% 89.90% 1,478,902 38,425 29,727
Aburamasari oilseed 16.41% 91.20% 1,448,464 56,653 42,536
Nature Genetics: doi:10.1038/ng.3657
Aviso oilseed 14.54% 91.44% 1,484,373 19,957 16,498
B.rapa
caizi-1 ssp.oleifera 3.94% 30.42% 32,045 16,509 14,263 caizi-2 ssp.oleifera 3.33% 32.59% 33,716 17,332 15,248 caizi-3 ssp.oleifera 4.23% 30.50% 32,113 16,526 14,217
dabaicai-2 ssp.pekinensis 1.08% 88.05% 61,209 32,366 28,134 dabaicai-3 ssp.pekinensis 1.59% 85.16% 70,622 37,301 32,146 dabaicai-4 ssp.pekinensis 1.49% 91.58% 76,857 40,556 35,104 dabaicai-5 ssp.pekinensis 1.71% 89.95% 78,357 41,440 35,528
ouzhouwujing-1 ssp.rapa(European) 1.76% 33.65% 34,980 18,451 15,895 ouzhouwujing-2 ssp.rapa(European) 1.04% 28.76% 29,929 15,821 13,775 ouzhouwujing-3 ssp.rapa(European) 1.31% 25.38% 26,453 13,944 12,150 ouzhouwujing-4 ssp.rapa(European) 2.69% 22.82% 23,988 12,454 10,871 ouzhouwujing-5 ssp.rapa(European) 1.30% 28.35% 29,141 15,461 13,284
xiaobaicai-1 ssp.chinensis 1.20% 88.34% 93,831 49,387 43,266 xiaobaicai-2 ssp.chinensis 1.45% 93.65% 101,323 53,548 46,246 xiaobaicai-3 ssp.chinensis 30.49% 77.46% 86,728 32,124 28,147 xiaobaicai-4 ssp.chinensis 0.67% 87.68% 91,055 48,517 41,871 xiaobaicai-5 ssp.chinensis 0.70% 79.28% 83,444 44,577 38,233
yazhouwujing-1 ssp.rapa(China) 0.88% 27.35% 27,655 14,835 12,568 yazhouwujing-2 ssp.rapa(China) 2.32% 33.41% 34,217 17,969 15,440 yazhouwujing-3 ssp.rapa(China) 4.66% 40.91% 43,260 22,094 19,132 yazhouwujing-4 ssp.rapa(China) 3.23% 32.88% 32,984 17,061 14,828 yazhouwujing-5 ssp.rapa(China) 3.48% 33.07% 33,190 17,182 14,840 youcai-sarson-1 ssp.tricolaris 1.75% 33.94% 36,041 18,979 16,416 youcai-sarson-2 ssp.tricolaris 0.64% 33.70% 35,531 18,810 16,469 youcai-sarson-3 ssp.tricolaris 0.63% 93.27% 101,565 53,675 47,199
Nature Genetics: doi:10.1038/ng.3657
youcai-sarson-4 ssp.tricolaris 1.30% 96.01% 107,567 56,871 49,244
Nature Genetics: doi:10.1038/ng.3657
Table 26a. Some instance picked to validate the assumption that Ks method cannot be used to estimate the divergence time.
Range of Ks Gene1 Gene2 Ks Gene
length (bp) Synonymo
us sites Substitutions Synonymous substitutions PAML KaKs_calculator
0.0045-0.0060
Bra012510 BjuA027390 0.0058 0.0052 720 192.178 1 1 Bra000412 BjuA011847 0.0053 0.0040 849 251.397 16 1 Bra012311 BjuA027207 0.005 0.0046 915 216.978 5 1 Bra000416 BjuA011843 0.0046 0.0043 912 231.189 4 1
0.0030-0.0045
Bra011481 BjuA003650 0.0042 0.0040 1086 252.333 2 1 Bra028624 BjuA004368 0.0037 0.0035 1074 285.792 1 1 Bra011334 BjuA003498 0.0037 0.0035 1119 290.041 3 1 Bra033402 BjuA029144 0.0037 0.0034 1167 293.969 3 1
0.0015-0.0030
Bra000878 BjuA013088 0.0027 0.0025 1665 397.661 2 1 Bra006602 BjuA022956 0.0022 0.0020 1986 500.723 2 1 Bra013510 BjuA002918 0.0022 0.0020 1989 491.097 2 1 Bra031704 BjuA031993 0.0016 0.0016 2583 631.875 1 1
0.0000-0.0015
Bra005954 BjuA022654 0.0013 0.0014 3366 726.106 4 1 Bra024449 BjuA029097 0.0011 0.0011 3681 919.882 2 1 Bra028276 BjuA038149 0.0007 0.0003 2826 738.721 3 2.34E-01 Bra033356 BjuA013654 0.0003 NA 3345 803.881 4 1.29E-05
Nature Genetics: doi:10.1038/ng.3657
Table 26b. Some instance picked between BraA and BnaA to validate the assumption that Ks method cannot be used to estimate the divergence time
Range of Ks Gene1 Gene2 PAML KaKs_calculator
Method Ks Method Ks Length (bp) Synonymous sites Substitutions Synonymous
substitutions
0.0045-0.0060
Bra008081 BnaA02g16400D NG 0.0050 YN 0.0050 930 200.00 5 1 Bra020309 BnaA02g06990D NG 0.0053 YN 0.0053 831 190.45 3 1 Bra009044 BnaA10g21970D NG 0.0049 YN 0.0053 825 187.64 2 1 Bra020161 BnaA02g05380D NG 0.0046 YN 0.0045 912 223.44 2 1
0.0030-0.0045
Bra036942 BnaA10g08820D NG 0.0042 YN 0.0033 1,131 304.75 1 1 Bra009533 BnaA10g26610D NG 0.0044 YN 0.0046 1,080 220.36 4 1 Bra009341 BnaA10g23000D NG 0.0036 YN 0.0033 1,191 301.15 1 1 Bra035336 BnaA02g23350D NG 0.0037 YN 0.0038 1,038 261.05 3 1
0.0015-0.0030
Bra009058 BnaA10g22100D NG 0.0026 YN 0.0024 1,680 420.86 2 1 Bra009091 BnaA10g25210D NG 0.0022 YN 0.0020 1,872 495.44 1 1 Bra035396 BnaA04g07390D NG 0.0023 YN 0.0019 2,217 517.96 4 1 Bra040547 BnaA05g33710D NG 0.0021 YN 0.0021 1,917 469.05 2 1
0.0000-0.0015
Bra022504 BnaA02g11830D NG 0.0013 YN 0.0012 3,189 812.40 1 1 Bra011666 BnaA01g01450D NG 0.0013 YN 0.0004 9,051 2504.36 5 1 Bra011665 BnaA01g01460D NG 0.0010 YN 0.0012 3,402 847.45 3 1 Bra011860 BnaA01g05510D NG 0.0005 YN 0.0011 4,287 904.14 9 1
Nature Genetics: doi:10.1038/ng.3657
Table 27a. Detailed sample information of homoeolog expression dominance in B. juncea
Run SRA number Strain PE/SE Data(MB) Tissue Stage Treatment
RR1822192 SRS859889 B. juncea (Czern) L. PAIRED 7,145 MB seedings two weeks control SRR1822193 SRS859888 B. juncea (Czern) L. PAIRED 6,788 MB seedings two weeks salinity
SRR1718914 SRS794662 B. juncea (Czern) L. PAIRED 3,706 MB whole seedings 7-days control
SRR1718916 SRS794664 B. juncea (Czern) L. PAIRED 2,582 MB whole seedings 7-days high temperature
SRR1718918 SRS794666 B. juncea (Czern) L. PAIRED 1,918 MB whole seedings 7-days drought
SRR2953668
SRS1173974 resynthesized SINGLE 208 MB silique walls after 21 days pollination -
SRR2953675
SRS1173967 resynthesized SINGLE 735 MB leaves after 3 days of flowering -
SRR1269499 SRX530145 B. juncea var varuna PAIRED 20,295 MB mixa young flower buds stage -
SRR807368 SRS406672 B. juncea PAIRED 3,815 MB seed coat unknown - SRR380274 SRX108496 B. juncea var. tumida SINGLE 357 MB stems 22 weeks after seeding -
SRR380273 SRX108497 B. juncea var. tumida PAIRED (PCR) 3,023 MB mixb - -
SRR380275 SRX108498 B. juncea var. tumida SINGLE 340 MB stems of Yong’an1 18 weeks after seeding - SRR380276 SRX108499 B. juncea var. tumida SINGLE 323 MB stems of Yong’an2 20 weeks after seeding - SRR380277 SRX108500 B. juncea var. tumida SINGLE 351 MB stems of Yong’an3 c 22 weeks after seeding - SRR380278 SRX108501 B. juncea var. tumida SINGLE 338 MB stems of Yong’an4 25 weeks after seeding -
T84-66 PRJNA285130 B. juncea var. tumida PAIRED 5,384MB mixc - -
Note: mixa - a pooled sample from inflorescence, leaf,pod and seed mixb - a pooled sample from yongan 1-4 mixc - a pooled sample from root, inflorescence, stem, seed and leaf
Nature Genetics: doi:10.1038/ng.3657
Table 27b. Homoeolog expression dominance in B. juncea varieties
No. Sample name Gene number with 2 FC
expression Non-expression genes
Binomial test p-value
Percentage of dominance genes in all genes (80,050) BjuA BjuB Non-dominance
1 Indian Control 2,813 3,353 6,921 1,575 6.49E-12 15.41% 2 Indian High
Temperature 3,224 3,810 5,967 1,661 2.97E-12 17.57%
3 Indian Drought 3,030 3,614 6,400 1,618 8.25E-13 16.60% 4 Indian Variant Salinity 2,880 3,403 5,905 2,474 4.42E-11 15.70% 5 Indian Variant Control 2,810 3,358 6,128 2,366 3.18E-12 15.41% 6 Sichuan Yellow 3,163 3,733 5,212 2,554 7.09E-12 17.23% 7 Varuna 2,892 3,432 7,712 626 1.19E-11 15.80% 8 T84-66 2,606 3,026 8,226 804 2.33E-08 14.07% 9 S.AABB 3,602 3,558 4,136 3,366 6.11E-01 17.89%
10 L.AABB 3,754 3,382 4,879 2,647 1.12E-05 17.83% 11 Daye3bianzhong 3,024 3,574 6,035 2,029 1.36E-11 16.48% 12 yongan 3,032 3,496 5,932 2,202 9.88E-09 16.31% 13 yongan1 2,967 3,445 5,985 2,265 2.53E-09 16.02% 14 yongan2 3,011 3,513 5,842 2,296 5.45E-10 16.30% 15 yongan3 2,853 3,382 6,036 2,391 2.22E-11 15.58% 16 yongan4 2,788 3,305 6141 2428 3.73E-11 15.22% Note: FC, fold change. Binomial test was applied to detect the significance of dominant genes number between subgenomes.
Nature Genetics: doi:10.1038/ng.3657
Table 31. Selective sweep analysis for B. juncea
#Chr Chr_len (bp) Fst πvegetable πoilseed θπ
J01 45,320,972 0.25049 0.00127 0.00121 1.04845 J02 25,468,329 0.24086 0.00166 0.00145 1.14438 J03 43,957,779 0.22047 0.00144 0.00131 1.09452 J04 33,175,529 0.29815 0.00142 0.00131 1.08362 J05 38,841,303 0.28376 0.00138 0.00099 1.38773 J06 38,054,634 0.25178 0.00151 0.00121 1.25646 J07 28,547,970 0.19727 0.00121 0.00115 1.04697 J08 29,345,952 0.26835 0.00166 0.00170 0.97189 J09 62,778,564 0.18350 0.00156 0.00146 1.06902 J10 22,354,451 0.19295 0.00166 0.00145 1.14481 J11 32,467,241 0.25257 0.00099 0.00102 0.96451 J12 60,369,535 0.20894 0.00118 0.00126 0.93017 J13 83,567,156 0.22871 0.00088 0.00116 0.76154 J14 28,343,526 0.23934 0.00087 0.00117 0.74025 J15 50,573,455 0.16598 0.00112 0.00121 0.91916 J16 18,736,220 0.26912 0.00133 0.00180 0.74141 J17 44,223,734 0.21805 0.00113 0.00127 0.89606 J18 77,673,196 0.22192 0.00079 0.00097 0.81239
Average 42,433,308 0.23290 0.00128 0.00128 1.00074
Nature Genetics: doi:10.1038/ng.3657
Table 34. Primers used for gene loss validation
Gene ID (Non-loss) Primers Gene ID (Loss) Primers
Bra039553/BjuA005100 F: ATGAAGCAAATATTTGGGAAATTAR: ACTAGCAGAAGTTTTCCCAGT
Bra006842 F: ATGGACAACAACAAAAGGAAAG R:TCATGCTCCGCTCTTTGGTC
Bra001721/BjuA012618 F: ATGTCTTCTGATTACTCACCTTR: GCTGTTGACTCTTTCTTCAGG
Bra007575 F: ATGGACATGATCAGTCAATTGT R: CTAAACCCAATAACACCACCC
Bra010005/BjuA023812 F: ATGGAGGAAGTTGAAGCTGCR: GGTGTGAGCTGACTGGGAG
Bra007462 F: ATGGCGAAGAGTTTGTGCATC R: TCAGAGAAGAAGGCGTAGAC
Bra005653/BjuA001706 F: ATGGAGTTTGTGAAATCGTTGGR: GTGACCATCTGTTCTCATCAGAA
Bra001370 F:GGGTCTATGAAGTCGGAGGA R:GGATCGGTCGAAATTATGCT
Bra025928/BjuA022182 F: ATGAGATCCTCATCGACTTCTR: CAAGGAGATCAAAGCATCGTC
Bra024553 F:GTAGGGAGGTGATCCATTCG R:TTAACAGGATCCACGAGCAC
Bra035936/BjuA032524 F: ATGGGTGGAGGGTCAAAGAGR: GGCAAGACCGTTTCTAGTCT
Bra014930 F:AAGATGCTTGGATGAGAGTGG R:CACTTCGAATTAACCTTTCTTGG
Bra008833/BjuA039310 F: ATGTTACCAAAGTTCGATCCGR: TCGACTGGAATGCAGCTCAT
Bra022306 F:TGCTGGTGCTCTTACCAATC R:CCGAGTTGCTTCCATCAGTA
Bra024822/BjuA024592 F: ATGGCGAAAAATCACGGCGTR: ATCATCATCATCCATTTCAAAGC
Bra016859 F:AGGAGAATACGAGGAAGGCA R:TAGCTGGCAAACATGTCCTC
Bra041032/BjuA015249 F: ATGATTTCCTTTGTTGGTCGAGR: TTAATTTGCTTTAGCCTTTGGAG
Bra018473 F:AACTGTGCAGCAGGTTTGAC R:CCCAACCATATTTCAACCAA
Bra000047/BjuA010390 F: ATGGCCGCAATCAGTTTCTCR: CTACTTAAACATATCGGCAAGT
Bra019268 F:TCACAGGGATGGCAACTTTA R:CTGTTGATGCTCCGATGAAT
BniB019213/BjuB024015 F: ATGGGCTCCCCTGTCTCGTR: CTCATCATCTCATTTTCAATCC
BniB002837 F:GATGCTTCTTGCCTCATCAA R:CCCAACACAGCAACGTTATC
Nature Genetics: doi:10.1038/ng.3657
BniB048496/BjuB021004 F: ATGCCTGTGTCCGTACATTC R: GACACGTCGGTACTCGTCT
BniB011344 F: ATGGCTCATGATGATTATGTAAA R: TCATGCTTTCGTCCTGCGC
BniB006554/BjuB010456 F: ATGAGAAACGTAGGGAGTTCG R: GAACCTTGTGTTTGATGGTCG
BniB013229 F:TGAACGTCTCTCCGTATTGG R:TTGGCTGAGAAGATGACGAG
BniB002930/BjuB003659 F: AAACCTTCCGCAGATTCTAGC R: TTAAACCATCTTTGTCACCGC
BniB016126 F:ATCACAGATGGAGCAGCTTG R:TGGGAAACGATGGATGACTA
BniB014100/BjuB007950 F: ATGAAACTAGAGCTAATCCTCG R: AGAGATATTACGAGTAACGTCT
BniB025882 F:AAACATGATTTCCGGAGGAG R:TTGCGGCTAGAATTTGGATA
BniB011005/BjuB028236 F: ATGGCACAAAAACTGGAAGCCA R: AGTGAAGGGCAGAACATGGA
BniB020222 F:GGTTCTCCTGGTTCCTGTGT R:ACCCTTTGGTTCAAGCTCAC
BniB010483/BjuB035357 F: ATGTTTCCCAGATTAGGTCGA R: TGCCTTTGCTGCTTTTAACTC
BniB025866 F:GAGAGAGCTCAGGCCAAGTT R:GCCAAGACTCTTCCTACGCT
BniB000067/BjuB037529 F: ATGTCGTCGTCTTCTCCGAG R: TCATGAGTCTGTAGCAGTAATAG
BniB025956 F:CAGGAAGAGTTGCTGTGGAA R:CACTACCTCCGAAGCTGTCA
BniB000071/BjuB037537 F: ATGGTGGCAGAAGCCATGAG R: TTAATACAGATAGATTTTGGTTTCC
BniB019567 F:TGACGATTGATCTTGATGCAG R:CCTTTGTTCTCAAAGTTCGGA
BniB000074/BjuB037542 F: ATGAAGTCATTAGAGAGAGTGG R: CTACCAGAACCGGTCTTTATTG
BniB024798 F:CAAACTCGGCAGAAATGAGA R:CCATCGTTCGATTCCTCTTT
Nature Genetics: doi:10.1038/ng.3657