supplementary materials and methods for · 2018-09-24 · (supplementary table 2a and supplementary...

83
SUPPLEMENTARY NOTE Genome size estimation by K-mer and flow cytometry analysis K-mer means a sequence with k nucleotides. The K-mer statistics was used to give discrete probability distributions of a number of possible K-mer combinations 1 . We counted the copy number of a given K-mer (17-mer) presented in sequence reads to divide the total length of sequence reads, then plotted the distribution of copy number. The K-mer distribution can be used to infer the genome size. The peak value of the frequency curve represents the overall sequencing depth. The algorithm should be represented as: (N×(L-K+1)-B)/D = G, where N is the total number of sequence reads, L is the average length of sequence reads, and k is K-mer length. To minimize the influence of sequencing error, K-mers with low frequency (< 3) are discarded. B is the total number of low frequency 17-mers, D is the overall depth estimated from K-mer distribution, and G denotes the genome size. The peak frequency of 17-mers is about 50X depth for B. juncea (Supplementary Fig. 1) and 45X depth for B. nigra (Supplementary Fig. 5). In addition, we employed flow cytometry 2 analysis to estimate genome size of B. juncea. The genome of O. sativa (Nipponbare) is taken as control 3 (supplementary Table 1). The genome size of B. juncea is a little bit less than previous published estimation (from 984 to 1006 Mb) by using flow cytometry analysis without control analysis 4,5 . High-throughput sequencing Whole genome sequencing for B. juncea and B. nigra A B. juncea var tumida inbred line (T84-66) with excellent agronomic traits and widely Nature Genetics: doi:10.1038/ng.3657

Upload: others

Post on 08-Jul-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

SUPPLEMENTARY NOTE

Genome size estimation by K-mer and flow cytometry analysis

K-mer means a sequence with k nucleotides. The K-mer statistics was used to give

discrete probability distributions of a number of possible K-mer combinations1. We

counted the copy number of a given K-mer (17-mer) presented in sequence reads to

divide the total length of sequence reads, then plotted the distribution of copy number. The

K-mer distribution can be used to infer the genome size. The peak value of the frequency

curve represents the overall sequencing depth. The algorithm should be represented as:

(N×(L-K+1)-B)/D = G, where N is the total number of sequence reads, L is the average

length of sequence reads, and k is K-mer length. To minimize the influence of sequencing

error, K-mers with low frequency (< 3) are discarded. B is the total number of low

frequency 17-mers, D is the overall depth estimated from K-mer distribution, and G

denotes the genome size. The peak frequency of 17-mers is about 50X depth for B.

juncea (Supplementary Fig. 1) and 45X depth for B. nigra (Supplementary Fig. 5).

In addition, we employed flow cytometry2 analysis to estimate genome size of B. juncea.

The genome of O. sativa (Nipponbare) is taken as control3 (supplementary Table 1). The

genome size of B. juncea is a little bit less than previous published estimation (from 984 to

1006 Mb) by using flow cytometry analysis without control analysis4,5.

High-throughput sequencing

Whole genome sequencing for B. juncea and B. nigra

A B. juncea var tumida inbred line (T84-66) with excellent agronomic traits and widely

Nature Genetics: doi:10.1038/ng.3657

Page 2: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

used as a parent in hybrid breeding and a B. nigra double haploid line (YZ12151) were

used for the reference genome sequencing. The genomic DNAs were extracted from

leaves with a standard CTAB extraction method. Genomic sequences were generated

using Illumina HiSeq™ 2000 & 2500 sequencing platforms with PE and MP libraries

(Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X

coverage of genome sequences from 17 B. juncea cultivars consisting of 10 vegetable-

and 7 oil-use sub-varieties were generated for crop usage selection analysis

(Supplementary Table 24). Low depth (<1 X) genome sequencing of 27 representative B.

rapa accession were generated for A-subgenome of B. juncea research (Supplementary

Table 25).

Single-molecule sequencing of B. juncea based on PacBio platform

The total DNA was extracted from the leaves. Nanodrop, Qubit2.0 and gel electrophoresis

were used to assess the DNA purity, concentration and integrity respectively. Seven μg

total DNA was used to construct a 20 kb DNA library for PacBio RS II platform (PacBio,

USA) sequencing according to the standard protocol. A total of 11.09 Gb SMART data

were generated, covering about 12 X of the B. juncea genome after QC (Supplementary

Figure 2).

Genome (Optical) maps of B. juncea based on IRYS system

Young leaves were preprocessed according to IrysPrepTM Plant Tissue-Nuclei protocol

and the DNA was extracted in line with IrysPrepTM Plus Long DNA Isolation protocol. DNA

with concentration ranges from 30 to 200 ng/ul and total amount of 300 ng was used to

precede the further experiments. The nicking, labelling, repair and staining processes

Nature Genetics: doi:10.1038/ng.3657

Page 3: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

were performed in strict accordance with the IrysPrepTM Labeling-NLRS (300 ng) protocol.

A total of 996,648 BioNano molecules were obtained with total length reaches 205 Gb,

covering around 222 X of the B. juncea genome. The optical maps were assembled using

Irys-scaffloding with default parameters and 922 optical maps were obtained with average

length of 1.19 Mb (Supplementary Table 4).

RNA-seq of B. juncea and B. nigra

Total RNAs of each tissue (root, stem, leaf, flower and silique) were extracted from the B.

juncea and B. nigra according to the instruction manual of the Trizol Reagent (Life

technologies, California, USA). Equal amounts of the high quality RNA samples from each

tissue were then pooled together for cDNA library construction of B. juncea and B. nigra

respectively. Approximately 11.56 Gb and 4 Gb transcriptomic data were generated for B.

juncea and B. nigra respectively using Illumina HiSeq™ 2000 sequencing platform

(Illumina, USA) with standard pipeline. The usable reads (after removing low quality reads)

obtained from all samples were de novo assembled using Trinity6 (Supplementary Table

20a and Table 20b).

Genome assembly and annotation for B. juncea and B. nigra

Raw data preprocess

In order to facilitate the assembly, a series of checking and filtering measures

corresponding to different platforms were performed.

The following criteria were used to filter Illumina low-quality reads:

1. Filter reads in which unknown nucleotides 'N' > 5%.

Nature Genetics: doi:10.1038/ng.3657

Page 4: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

2. Filter low quality reads in which average PHRED-like score < 20% (representing that

sequencing error rates < 1%).

3. Clip bases whose PHRED score < 20 at the end of reads. Reads less than 30 bp

would be discarded after low quality bases clipping for long library reads.

4. Filter reads with adapter contamination. Reads with more than 10bp aligned to the

adapter sequence (allowing less than or equal to 3bp mismatch) were removed.

In all, 1,653,212 raw reads were produced by 11 Pacbio cells. Then the reads with

quality < 0.75 or length < 500 bp were filtered and a total of 626,640 reads were retained.

Next the SMART reads were corrected by Ectools7 using the ALLPATHS-LG assembled

contigs with default parameters and corrected reads longer than 3 kb were retained.

The basic handlings of BioNano raw data were preceded using IrysView package.

Molecules with length > 100 kb, label SNR >= 3.0 and average molecule intensity < 0.6

were retained for further genome assembling.

Genome assembly for B. juncea

First, all the Illumina reads after the above filtering and correction steps were used for de

novo assembly by the ALLPATHS-LG8 with the default parameters. Then all the corrected

Pacbio RS II reads were used to fill the gaps by PBjelly_V15.2.209 with parameters:

--minGap 1 -minMatch 8 -minPctIdentity 70 -bestn 1 -nCandidates 20 -maxScore -500

-nproc 5 -noSplitSupreads, which resulted in a genome with total size 784 Mb. Next,

RefAligner utility in IrysView was used to perform alignment between Irys molecules and

draft assemblies for correcting the scaffolds’ chimera error. We were expected to break

the scaffolds at the gaps nearest to the candidate enzyme sites (Supplementary Figure 3).

Nature Genetics: doi:10.1038/ng.3657

Page 5: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

The candidate break sites were screened out under the following circumstances: 1.

scaffolds are longer than 100 kb with more than 20 enzymes sites; 2. candidate sites are

not located at the edges (< 10 sites or <50 kb from the start or end base ) of a scaffold; 3.

candidate sites are “covered across” by less than 3 molecules (“cover across” means

more than 2 sites matched at both sides of the candidate break sites); 4. more than 3

enzyme sites between two candidate break sites. Altogether, 180 scaffolds were

disconnected at 233 candidates break sites. Finally, the corrected scaffolds were

anchored to the optical maps.

Genome assembly for B. nigra

For B. nigra genome assembly, the Illumina high quality reads were used for de novo

assembly by the software ALLPATHS-LG8 with the default parameters. The software

GapCloser (GapCloser v1.12 for SOAPdenovo10) was used to fill gaps and improve the

quality of the scaffolds by comparison with short paired-end libraries (inserted size < 1Kb).

Genome quality assessment

We searched the CEGMA v.2.3 method11 which including 458 conserved Core eukaryotic

genes (CGE database12) to assess the completeness of finial genome assembly of B.

juncea and B. nigra (Supplementary Table 14a).

The assembled genome of B. juncea and B. nigra was also validated by mapping

23,002 ESTs and 18344 ESTs (length >=500 bp) downloaded from NCBI (GenBank) to

the corresponding genome (Supplementary Table 14b).

To assess the accuracy of the B. juncea and B. nigra genome assembled by HiSeq

sequencing data, we randomly selected 10 sub-reads longer than 40 Kb from PacBio data

Nature Genetics: doi:10.1038/ng.3657

Page 6: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

for B. juncea and downloaded 15 BAC sequence from GeneBank for B. nigra. Firstly, 10

sub-reads were mapped to B. juncea genome using BlasR53 and 15 BAC sequences were

anchored to assembly genome using blastn for B. nigra. Then, the blastn results were

chained to larger syntenic region to identify corresponding scaffolds for each BAC. Finally

the formulas (coverage = alignment length/BAC or subread length; identity = matched

length / BAC or subread length without gap) were used to calculate the coverage and

identity for each sub-read and BAC sequence (Supplementary Table 12 and 13).

Furthermore, to inspect the paired end relationship for B. juncea and B. nigra, the mate

pair reads (3/5/10/15k for B. juncea, 3/5/10K for B. nigra) were mapped to whole

assembly genome using SOAP13(Supplementary Figure 6).

Genetic maps and pseudo-chromosome construction of B. juncea and B.

nigra

Genetic map of B. juncea

We constructed a reference genetic map of B. juncea based on genotyping by

whole-genome resequencing for F2 population14,15. Two parental inbred lines of

near-isogenic homozygous T84-63 (paternal line and the reference cultivar in our genome

sequencing project) and B. juncea var. napiformis homozygous line ‘03A0106’ (maternal

line) were chosen to develop a F2 mapping population. In total, 100 individuals were

randomly selected from F2 population for segregation analysis and genetic mapping

(Supplementary Table 8). PE reads generated from two parental lines and 100 F2 lines

resequencing through Illumina HiseqTM 2000 platform were aligned to T84-63 draft

Nature Genetics: doi:10.1038/ng.3657

Page 7: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

genome using BWA16 with default parameters. Potential SNP were identified by GATK

v3.417. Before genotyping, the following criteria were applied to reduce false discovery

rate of SNPs. 1) remove SNPs with effective depth lower than 10 for paternal line and 6

for maternal line; 2) remove SNPs with MAF < 0.05; 3) remove copy number >1.5. Due to

the low coverage sequencing data of F2 lines, in order to improve the data integrity,

genotype of offsprings was imputed using LB-Impute software18, and the Markov trellis

window was set to a length of 5. After imputation, SNPs with integrity lower than 0.7 was

filtered out, a marker set of 62580 SNPs was obtained. Pair-wise recombination of this

marker set on each scaffold were calculated, adjacent SNPs with pair-wise recombination

rate less than 0.001 were lumped into a genetic bin, after excluding bins showing

significantly distorted segregation (Chi-square test, P-value < 0.01). A final set of 5333 bin

markers was grouped to 18 linkage groups (Supplementary Table 9) using Highmap

software19.

Assignment of subgenomes and pseudochromosome construction

We sorted BjuA and BjuB subgenomes of B. juncea referred on the genetic map

(T84/DTC) constructed in this paper and SY/PM publishe20. Genome assembly was

assigned to the corresponding sub-genomes of B. juncea according to the integrated

information of two above genetic maps. Allmaps software21 was used to construct the

initial pseudo-chromosomes of B. juncea from scaffolds using genetic map T84/DTC and

SY/PM. For those scaffolds un-anchored genetically, synthetic relationships between B.

juncea and their ancestral genomes B. rapa and B. nigra were investigated after

assignment of subgenoms. The final pseudo-chromosomes were constructed combining

Nature Genetics: doi:10.1038/ng.3657

Page 8: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

the information of genetic map T84/DTC and SY/PM and the synthetic map of genetically

un-anchored scaffolds (Supplementary Fig. 5).

Genetic map of B. nigra

AllMaps software21 was also used to construct the initial pseudo-chromosomes of B.

nigra from scaffolds using the linkage group of T84/DTC.

Repeats annotation

The repeats sequence of B. juncea and B. nigra genome were distinguished with a

combination of de novo and homolog strategies. The results from four de novo programs

including RepeatScout22, LTR-FINDER23, MITE24 and PILER25 were merged as the initial

repeat library. The initial repeat database was classified into classes, subclasses,

superfamilies and families by the PASTEClassifier.py script included with REPET26. We

then merged TE sequences of Brassica species (B. juncea, B. nigra, B. rapa, B. oleracea

and B. napus) and the known repbase database27 together to construct a new repeat

database. Finally this new repeat database was used to distinguish the genome assembly

repeat sequences through RepeatMasker28 (Supplementary Table 15).

Gene model prediction and evaluation

Genes were annotated iteratively using three main approaches: homology-based (H), de

novo (D) and EST/unigenes-based (C). Results of these three methods were integrated

by the GLEAN29 to get high confidence gene model by combing all evidence.

Homology-based method (H): Protein sequences from 2 sequenced eudicot species: A.

thaliana and B. rapa from the public database, were used to perform prediction. We used

Nature Genetics: doi:10.1038/ng.3657

Page 9: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

the GeneWise (v2.2.0)30 to determine the accurate gene structure. For de novo prediction,

we used Augustus with parameters trained by unigenes from transcriptome data,

Genscan31 and GlimmerHMM32 with Arabidopsis parameters to obtain de novo gene

models. In the third approach, unigenes were aligned to the genome assembly using

BLAT (identity >= 0.95, coverage >= 0.90) and then filtered using PASA.

After combining all evidence to generate gene model by glean29, RNA-seq-based

method mapping transcriptome data to the reference genome using TopHat33 and

assembling transcripts with Cufflinks33 was adopted to obtain the gene structures and new

genes. We filtered short gene mode (< 150 bp) and single exon gene mode to generate

final gene set for further analysis (Supplementary Table 18).

Gene model evaluation

The resultant gene set contains 80,050 protein-coding gene models, with a mean CDS

size of 1,111.07 bp and an average of 4.57 exons per gene. We used the RNA-seq data to

evaluate the gene model predication (Supplementary Table 20).

Gene function annotation

Gene functions were assigned according to the best match of the alignments against

various protein database using BLASTP (E-value = 1e-5), including the non-redundant

protein (Nr) database, Swiss-Prot database. Furthermore, unigenes were searched

against the NCBI non-redundant nucleotide sequence (Nt) database using BLASTN by a

cut-off E-value= 1e-5. Gene were retrieved based on the best BLAST hit (highest score)

Nature Genetics: doi:10.1038/ng.3657

Page 10: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

along with their protein functional annotation. InterProScan was run on the gene models to

provide a list of INTERPRO domains34,35 and GO terms for each B. juncea gene. In order

to predict the most probable function of the genes, all genes were aligned (E-value= 1e-5)

with KEGG proteins, and the pathways were considered present for B. juncea as long as

there were matches to B. juncea genes. The gene sequences were also aligned to the

Clusters of Orthologous Group (COG) database to predict and classify functions. Kyoto

Encyclopedia of Genes and Genomes (KEGG) pathways were assigned to the assembled

sequences using the online KEGG Automatic Annotation Server (Supplementary Table

19).

Non-coding RNA annotation

tRNAscan-SE (version 1.23) was applied to detect reliable tRNA positions and other

non-coding RNAs (ncRNAs) were predicted by software Infernal using default

parameters36,37. Through comparing the second structure similarity between B. juncea, B.

nigra genome and database Rfam (v12.0) 38, the ncRNAs were classified into different

families (Supplementary Table 21).

TE content comparison between allopolyploid subgenomes and its

diploid parents

To further increase the accuracy and precision for the comparison of TEs between the

sub-genomes and their ancestors, only TEs located in corresponding syntenic regions

without gaps(BjuA-BraA, BjuA-BnaA, BjuB-BniB or BolC-BnaC) were consideration. This

Nature Genetics: doi:10.1038/ng.3657

Page 11: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

stringent rule could effectively reduce the influence of assembly quality. The detailed

method about syntentic blocks identification see Section ”Lost gene identification and

classification”.

Newly formed TE identification after divergence from its ancestors

The following criteria were used to identify new TE for BjuA subgenome (Supplementary

Fig 8).

1. Filter out simple sequence repeats and short sequences (<200 bp) from BjuA repeat

annotation result.

2. For each TE instance, we selected a pair of markers sequence with length of 200bp

(purple block in Supplementary Fig. 8) which were located 1 kb upstream from the start

site of TE and 1 kb downstream from the end site of TE (M: blue block). Then the paired

marker sequence were searched against BraA genome using BLASTN (Evalue < 1E-5).

The strategy could ensure that the marker sequences located in non-repeat region.

3. In order to obtained highly confident result, we only retained paired-markers

satisfying the following criteria: 1) the paired-markers found high conserved match

sequence in BraA genome (identity > 90% and matched length > 180 bp); 2) the

paired-markers are located in same chromosome; 3) the distance between

paired-markers is shorter than 2*M+TE (purple block in BraA).

4. According to the distance between paired markers mapping to the BraA genome

(purple block in Supplementary Fig. 8), the TE can be been classified into four

circumstance clarified as below.

Nature Genetics: doi:10.1038/ng.3657

Page 12: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

A) If the distance between the paired markers in BraA was similar with BjuA counterpart

and TE in both BjuA and BraA belonged to the same TE category, the TE is regarded as

common TE.

B) If the distance between the paired markers was shorter than 20 bp, the TE is

regarded as a high confident new TE in BjuA because the TE is absent in BraA.

C) If the distance between the paired markers in BraA is approximately equal to the

distance between paired markers in BjuA (distance contained in to 2*(M-L)-30 and 2*(M-L)

+ 30, L: length of TE less annotated), the TE in BjuA is regarded as Annotation less TE.

D) If the distance between the paired markers in BraA was approximately equal to the

distance between paired markers in BjuA (distance contained in to 2*(M+L)-30 and

2*(M+L) + 30, L: length of genome sequence less assembly), the TE in BjuA is regarded

as Assembly less TE.

Same strategy was applied to identify the new TE in subgenomes of B. juncea, B.

napus compared to their corresponding ancestral genome after divergence from common

ancestor (Supplementary Table 16a and 16b).

Newly formed TEs were proofed by PCR amplification by using degenerated primers at

upstream and downstream of TEs in B. jucnea (Supplementary Fig. 9).

Newly formed TE model in allopolyploid genome (AABB and AACC)

The newly formed BjuA TE come from two sources, one come from intra-subgenome

transposition, the other come from BjuB as inter-subgenome transposition. The new TE

as query was to search B. juncea genome of A and B ancestor genomes by BLASTN. The

Nature Genetics: doi:10.1038/ng.3657

Page 13: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

sequence homology show us the new TE come from A ancestror as intra-subgenome

transposition or from B ancestor as inter-subgenome transposition. The alignment results

that the length of new TE sequence sharing less than 50% sequence identity of it were

filtered. All TE categories were identified according to the criteria.

1. Whether the difference of sequence identity between subgenomes of B. juncea less

than threshold (5%),

2. The origins of TEs were separated according to the identity between subgenomes of

B. juncea. If the identity difference below 5% the new TE is considered to be common

category. Using the same approach, we re-annotated new TEs in the B. juncea genome,

and identified cross transposition TEs (Supplementary Table 17).

Gene losses in the reference genome

To call synteny blocks, we performed all-against-all BLASP (E-value=1e–5)39 and chained

the BLASP hits by QUOTA-ALIGN40 (cscore=0.5) with “1:1 synteny screen”. At least 4

gene pairs were required for synteny block and two adjacent synteny blocks were merged

together if the distance less than 20 gene paired between each other. The “1:3 synteny

screen” model were used to identified synteny block between A. thaliana and Brassica

because of whole genome triplication41 in Brassica evolution history by QUOTA-ALIGN

(cscore = 0.5).

To search the Brassica ancestral common gene sets (Supplemental Table 29), we

performed a pairwise synteny comparisons with each other (BraA, BniB, BjuA, BjuB, BolC,

BnaA, BnaA, BnaC, Ath) for each species, collecting a set of syntenic matches. All loss

genes identifications were based on Bracssica ancestor common gene sets of each

Nature Genetics: doi:10.1038/ng.3657

Page 14: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

species. We focused on gene sets that were located within the identified syntenic blocks

between the BraA and BjuA. If we could not find an annotated gene within the syntenic

blocks, then we search the gene CDS sequence against the entire BjuA genome using

BLASTN (E-value = 0.01, identity =90%). The gene which has not an ortholog to BjuA was

regarded as lost to the ancestral gene of BraA in BjuA subgenome. The procedure used

allowed confident filtering of candidate lost genes, where one BjuA homeologous gene

copy or one parental gene copy was missing at the DNA sequence level from genome

assemblies. The best BLASTN DNA sequence match, found elsewhere in the genome,

was the corresponding homeolog (if in BjuA genome) or ortholog (if in BraA genomes).

We further studied cases that sequence match out of syntenic blocks. These cases that

match to other syntenic blocks were identified through following method. We used a more

appropriate splice-aware aligner GMAP42 to align the diploid coding sequences in the

syntenic region and checked if the aligned ancestral gene model retained a complete

open reading frame in the ancestral. If BLASTN DNA sequence matches at orthologous

positions with no annotation or a gene could be predicted by geneid software43, then the

sequences were blat to the loss gene. When the sequence length of matched to loss gene

more than 70%, the gene was predictable. In order to find real gene loss, if the length of

loss gene sequence sharing less than 20% sequence identity of itself length were

regarded as whole genes lost missing DNA sequences. The genes were eventually

labeled as ‘partial loss’ if the mapped gene model lacked a start or stop codon, or

‘pseudogenes’ if there were internal stop codons. Following this stringent analysis, we

found an initial set of 303 candidate lost genes (where the DNA sequence was missing)

Nature Genetics: doi:10.1038/ng.3657

Page 15: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

and 845 candidate part-loss genes and 583 candidate pseudogenes in the BjuA assembly

as compared to the corresponding parental genome. Similarity, other subgenomes (BniB,

BjuB, BolC, BnaA, BnaC) from Brassica species were selected to confirm the ancestral

gene sets to seek gene loss (Supplementary Table 22).

Validation of gene loss

To exclude false positive lost gene, we mapped (uniquely) raw Illumina reads (~26X) from

BjuA to the ancestral genome BraA. Each of the BjuA “missing syntenic genes” was

confirmed as:

(a) All the above identified 303 BjuA missing genes (no DNA sequence found) were

carefully checked for confirmation based on raw sequence read coverage (less <5% than

genome sequencing depth) (Supplementary Table 22). This confirmed the highly

confident deletion of 156 genes. These were detected because the average depth after

mapping BjuA raw sequence was lower than expected.

(b) Not deleted where normal sequence read coverage similar to the average of the

genome was observed, such as truncation and pseudogenization of genes.

We mapped (uniquely) raw Illumina reads from BjuA RNA-seq to the genome BjuA.

Each of the truncation and pseudogenization of genes of BjuA was confirmed as partial

deleted, based on no RNA-seq read coverage on its genome. All the above identified 845

BjuA partial missing genes (no RNA sequence found) were carefully checked for

confirmation based on raw RNA-seq read coverage to identified 349 highly confident

part-loss gene (Supplementary Table 18). Sequence changes resulted in disruption of

Nature Genetics: doi:10.1038/ng.3657

Page 16: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

open reading frames and therefore the corresponding gene model was considered “partial

lost”, but remnants of the genes still retain some sequence similarities to the ancestral

genes. Similarity, other subgenomes (BniB, BjuB, BolC, BnaA, BnaC) from Brassica

species were selected to count the number of gene loss (Supplementary Table 22).

We then randomly selected 20 gene loss events (20 non-loss) and validated them using

PCR amplification, of which most gene loss events were confirmed by PCR amplification.

We think that might be caused by missing assembly in genome or non-specific

amplification of target genes because of possible homological genes in polyploidy or. The

primers used in this validation were listed in Supplementary Table 34.

Gene expression calculation and homoeolog expression dominance

identification

Gene expression calculation

The clean reads that were filtered from the raw reads were mapped onto B. juncea

genome using Tophat244. The top 200 results of alignment will be exported when multiple

reads map to the same locations by TopHat2. Gene expression levels of individual genes

were quantified using RPKM values (fragments per kilobase of exon per million fragments

mapped) by the Cufflinks45.

Homoeolog expression dominance gene between subgenomes of B. juncea

The homolog expression bias was performed within syntenic gene pairs. Differentially

expressed genes pairs that pass the 2 fold change threshold are regarded as dominant

gene pairs, either A dominance or B dominance. The dominant genes are the genes that

Nature Genetics: doi:10.1038/ng.3657

Page 17: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

expressed relatively higher in dominant gene pairs, and the lower ones are subordinate

genes. The rest of syntenic gene pairs that shows non-dominance are classified as

neutral genes. The number of A dominant gene pairs, B dominant gene pairs and

Non-dominant gene pairs are shown in Supplementary Table 27b. To test whether the

occurrences of an A dominant gene pair and the occurrences of B dominant gene pair are

equal, we perform double-side binomial tests on dominant gene pairs for all samples54,55

(Supplementary table 27b).

Selective pressure on dominantly expressed genes and subgenomes

All SNPs set were called by GATK17 for 17 B. juncea accessions with default parameters

and filtered out with depth < 3X (Supplementary Table 25). Then CDS sequence set was

reconstructed based on high quality SNPs for each sample. To detect selective pressure

acting on each coding gene, the rates of nonsynonymous (dN) and synonymous (dS) (ω

=dN/dS) substitutions were estimated site-by-site using the YN00 program with default

parameters from the PAML 4.2b package46. Each paired gene sets of 17 samples were

estimated repeatedly. All Ka/Ks of gene pairs were classified to three categories

(dominant genes, subordinated genes and neutral genes). Meanwhile All Ka/Ks of gene

pairs was separated into BjuA/BjuB subgenome. In order to test statistical significance of

different data sets, we perform a permutation test on them with 1000 permutations.

Boxplots were carried out to study the difference of selective pressure among three

category genes and between subgenomes.

Nature Genetics: doi:10.1038/ng.3657

Page 18: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

Diversification of A-subgenome of B. juncea and B. napus

Phylogenetic reconstruction for A-subgenomes in Brassica

We called variations from resequencing of 18 B. juncea accessions A-subgenomes

including B. juncea reference A-subgenomes, 5 B. napus accessions including 1 B. napus

reference A-subgenomes, and 27 B. rapa accessions including 1 B. rapa reference

sequence that cover most subspecies of B. rapa. B. rapa genome was considered as

reference genome for all resequencing accessions. BWA16 and GATK17 was used to call

SNPs from resequencing data for 18 B. juncea accession, 5 B. napus accessions with

default parameters (Supplementary Table 25). We filtered out the SNPs with depth <3X.

BWA16 and Samtools47 were used to call variations from resequencing data of B. rapa with

default parameters. Ungenotyped SNP loci were imputed by the KNN algorithms48. SNPs

with MAF > 0.05 were picked out for further analysis. In this step, a total of 198,497 SNPs

were initially screened out and only non-hete SNPs with integrity > 0.6 were kept for tree

construction. To build the tree, all SNPs from resequencing samples represented most B.

rapa subvarieties were concatenated as alignment sequennce by referring to B. rapa

genome. Then the neighbor-joining tree for A-subgenomes in Brassica population was

constructed by MEGA v6.0 using Kimura 2-parameter model with 1000 bootstraps and

default parameters.

Principal component analysis

A total of 198,497 SNPs were initially identified from A-subgenomes with same method

and resequencing assessions described above. Only 51,116 high quality SNPs with

integrity >= 0.8 and MAF>=0.05 were selected for principal component analysis. The

Nature Genetics: doi:10.1038/ng.3657

Page 19: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

EIGENSOFT package56 combines functionality from the population genetics methods and

EIGENSTRAT stratification correction method. We used STRATPCA software from

EIGENSOFT package to implement principal components analysis with 51,116 genetic

markers. Principal component analysis displayed that A-subgenomes of vegetable- and

oil-use subvarieties of B. juncea were distributed nearby B. rapa ssp. tricolaris group and

far from other sub-species of B. rapa supporting its ancestor is evolved from one variety of

B. rapa as B. rapa ssp. tricolaris (Supplementary figure 13).

Characteristics of SNP variations from A-subgenomes of B. juncea and B. napus,

and vegetable- and oil-use B. juncea

B. rapa genome was taken as the reference for SNP calling. Total of 4,589,419 SNPs from

18 B. juncea, including 11 vegetable- and 7 oil- use and 5 B. napus samples were

simultaneously identified using the same method described above. To compare the

characteristics of the SNPs from B. juncea and B. napus, 6 B. juncea samples including 3

vegetable- and 3 oil- use samples(CN53, CN58, CN04 and CN02, EU07, AU213) and 5 B.

napus samples were set as B. juncea and B. napus groups respectively. For the

characteristics of vegetable- and oil- use B. juncea, The 11 vegetable- and 7 oil- use B.

juncea samples were set as vegetable- and oil- use groups. We only kept the SNP locus

with full integrity( integrity =1) for further analysis. Finally, a total of 992,788 SNPs for B.

juncea and B. juncea groups, and 1,716,765 SNPs for vegetable- and oil- use groups

were considered. The B. juncea and B. napus groups domiant SNP (polymotphic SNP)

was defined as that the frequency of alleles >=60% in B. juncea and B. napus groups and

different to the reference. The B. juncea specific SNP (fixed SNP) was defined as that the

Nature Genetics: doi:10.1038/ng.3657

Page 20: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

frequency of alleles >=60% in B. juncea subgroups and the genotype was different

between two B. juncea and B. napus subgroups and different to the reference. Four

frequency of alleles scale (60%, 70%, 80%, 90%) were carried out for SNPs. Same

strategy was used for vegetable- and oil- use dominant and specific SNP analysis

(Supplementary figure 14).

Formation time estimation for B. juncea and B. napus

The average CDS length of B. juncea and their progenitor, B. napus and their progenitor is

around 1000bp (Supplementary Table 29). One mutation in CDS means its Ks value is

0.003 corresponding to 0.1 Mya approximately. For the Ks distribution of BraA vs BjuA,

BniB vs BjuB, BraA vs BnaA, BolC vs BnaC, artificial peaks would be found in

Ks-distribution plot which may mislead the diverge time calculation (Supplementary Fig.

12). We selected all syntenic gene pairs hadone synonymous substitution site and

calculated their Ks by PAML and KaKs_calculator to validate this assumption. The result

demonstrated that the Ks method was not appropriate for divergence time estimation in a

short period.

There were inherent flaw of Ks method in divergence time calculation for newly formed

species such as subgenomes of B. juncea and their parents. To estimate the formation

time of B. juncea, we firstly selected BjuA and its closest relative genome, BjuA and the

earliest divergent B. juncea accessions referred to phylogenetic tree of B.rapa population

(Figure 2b). Same strategy was applied for B. napus. Then we reconstructed CDS

sequences for selected samples from resequencing data. After multiple sequence

alignment by MUSCLE v3.349, phylogenetic tree was constructed and divergence time

Nature Genetics: doi:10.1038/ng.3657

Page 21: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

was estimated by Bayesian MCMC analyses in BEASTv1.850 with JIT nucleotide

substitution model, relaxed log normal clock model, 1 million MCMC generations from

which parameters were sampled every 1000 generations and other default parameters.

One calibration time (4.6±0.5 MYA) for B. oleracea in previous publication51 was

adopted as outgroup calibration point to estimated diverge time for B. juncea and B.

napus. We calculated the divergence time between BjuA and its closest relative genome

(tricolaris in red bold line of Figure 2b) as the upper limit of formation time. And the

divergence time between BjuA and the earliest divergent B. juncea accessions (B. juncea

in red bold line of Figure 2c) was considered as the lower limit of formation time.

Accordingly, we referred BnaA and its closest relative genome (European rapa in blue

bold line of Figure 3a) as the upper limit of formation time, BnaA and the earliest divergent

B. napus accessions (B. napus in blue bold line of Figure 2b) as the low limit of formation

time (Figure 2c).

Detection of selective sweep signals

Average pair-wise diversity (π) and population differentiation statistic (Fst) were

calculated based on the reference SNPs using the Genome Analysis Toolkit (GATK)

V3.452 using default parameters. Selective sweep regions were identified in the 10

vegetable- and 7 oil-use B. juncea sub-varieties by combining Fst outliers and π ratio

outliers (θπ (vegetable-use/oils-ues)). Calculations of π ratios and Fst were based on 100

kb sliding windows with 10 kb steps. The genomic windows where the average Fst fell in

the top 5% of the empirical Fst distribution were defined as the Fst outliers. Similarly, the

Nature Genetics: doi:10.1038/ng.3657

Page 22: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

genomic windows where the π ratio fell in the top 5% of the empirical π ratio distribution

were defined as the outliers. Adjacent windows extended to 10Kb likely represent the

effect of a single divergence region and thus were linked to define a ‘candidate gene

region’ (Supplementary Tables 29, 30 and 31).

Nature Genetics: doi:10.1038/ng.3657

Page 23: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

SUPPLEMENTARY REFERENCES

1. Li, R. et al. The sequence and de novo assembly of the giant panda genome. Nature

463, 311–317 (2010).

2. Ohmido, N. et al. Quantification of total genomic DNA and selected repetitive

sequences reveals concurrent changes in different DNA families in indica and japonica

rice. Mol. Gen.Genet. 263, 388–394 (2000).

3. International Rice Genome Sequencing. The map-based sequence of the rice genome.

Nature 436, 793–800 (2005).

4. Aurmuganathan, K. & Earle E.D. Nuclear DNA content of some important plant

species. Plant Mol. Biol. Rep. 9, 208–218 (1991).

5. Johnston, J.S. et al. Evolution of genome size in Brassicaceae. Ann. Bot. 95, 229–235

(2005).

6. Haas, B.J. et al. De novo transcript sequence reconstruction from RNA-seq using the

Trinity platform for reference generation and analysis. Nat. Protoc. 8, 1494–1512

(2013).

7. Lee, H. et al. Error correction and assembly complexity of single molecule sequencing

reads. BioRxiv DOI, 10.1101/006395 (2014).

8. Maccallum, I. et al. ALLPATHS 2, small genomes assembled accurately and with high

continuity from short paired reads. Genome Biol. 10, R103 (2009).

9. English, A.C. et al. Mind the gap, upgrading genomes with Pacific Biosciences RS

long-read sequencing technology. PLoS One 7, e47768 (2012).

10. Luo, R. et al. SOAPdenovo2, an empirically improved memory–efficient short–read de

novo assembler. Giga Sci. 1, 18 (2012).

11. Parra, G., Bradnam, K. & Korf I. CEGMA, a pipeline to accurately annotate core genes

in eukaryotic genomes. Bioinformatics 23, 1061–1067 (2007).

12. Ye, Y.N., Hua, Z.G., Huang, J., Rao, N. & Guo FB. CEG, a database of essential gene

clusters. BMC Genomics 14, 769 (2013).

13. Gu, S., Fang, L. & Xu X. Using SOAPaligner for short reads alignment. Curr. Protoc. in

Bioinformatics DOI, 10.1002/0471250953 (2013).

Nature Genetics: doi:10.1038/ng.3657

Page 24: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

14. Huang, X. et al. High-throughput genotyping by whole-genome resequencing. Genome

Res. 19, 1068–1076 (2009).

15. Mun, J. et al. Construction of a reference genetic map of Raphanus sativus based on

genotyping by whole-genome resequencing. Theor. Appl.Genet.128, 259–272 (2015).

16. Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler

transform. Bioinformatics 25, 1754–1760 (2009).

17. DePristo, M.A. et al. A framework for variation discovery and genotyping using

next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011).

18. Fragoso, C.A., Heffelfinger, C., Zhao, H. & Dellaporta, S.L. Imputing genotypes in

Biallelic populations from low-coverage sequence Data. Genetics 202, 487–495

(2016).

19. Liu, D.Y. et al. Construction and analysis of high–density linkage map using

high-throughput sequencing data. PLos One 9, e98855 (2014).

20. Zou, J. et al. Co-linearity and divergence of the A subgenome of Brassica

juncea compared with other Brassica species carrying different A subgenomes. BMC

Genomics 17, 18 (2016).

21. Tang, H. et al. ALLMAPS, robust scaffold ordering based on multiple maps. Genome

Biol. 16, 3 (2015).

22. Price, A.L., Jones, N.C. & Pevzner, P.A. De novo identification of repeat families in

large genomes. Bioinformatics 21 S1, i351–i358 (2005).

23. Xu, Z. & Wang, H. LTR_FINDER, an efficient tool for the prediction of full-length LTR

retrotransposons. Nucleic Acids Res. 35, W265–268 (2007).

24. Han, Y. & Wessler S.R. MITE-Hunter, a program for discovering miniature

inverted-repeat transposable elements from genomic sequences. Nucleic Acids Res.

38, e199 (2010).

25. Edgar, R.C. & Myers, E.W. PILER, identification and classification of genomic repeats.

Bioinformatics 21 S1, i152–i158 (2005).

26. Wicker, T., Matthews, D.E. & Keller, B. TREP, a database for Triticeae repetitive

elements. Trends Plant Sci. 7, 561–562 (2002).

27. Bao, W., Kojima, K.K. & Kohany, O. Repbase Update, a database of repetitive

Nature Genetics: doi:10.1038/ng.3657

Page 25: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

elements in eukaryotic genomes. Mobile DNA 6, 11 (2015).

28. Chen N. in Current protocols in Bioinformatics Version 1. (eds Andreas D Baxevanis)

1–14 (John Wiley & Sons, Inc, 2004).

29. Elsik, C.G. et al. Creating a honey bee consensus gene set. Genome Biol. 8, R13

(2007).

30. Birney, E., Clamp, M. & Durbin, R. GeneWise and Genomewise. Genome Res. 14,

988–995 (2004).

31. Burge, C. & Karlin, S. Prediction of complete gene structures in human genomic DNA.

J. Mol. Biol. 268, 78–94 (1997).

32. Allen, J.E., Majoros, W.H., Pertea, M. & Salzberg, S.L. JIGSAW, GeneZilla, and

GlimmerHMM, puzzling out the features of human genes in the ENCODE regions.

Genome Biol. 7 S1, S9 1–13 (2006).

33. Trapnell, C. et al. Differential gene and transcript expression analysis of RNA-seq

experiments with TopHat and Cufflinks. Nature Protoc. 7, 562–578 (2012).

34. Hunter, S. et al. InterPro in 2011, new developments in the family and domain

prediction database. Nucleic Acids Res. 40, D306–312 (2012).

35. Hunter, S. et al. InterPro, the integrative protein signature database. Nucleic Acids Res.

37, D211–215 (2009).

36. Lowe, T.M. & Eddy, S. tRNAscan-SE, a program for improved detection of transfer RNA

genes in genomic sequence. Nucleic Acids Res. 25, 955–964 (1997).

37. Nawrocki, E.P. & Eddy, S.R. Infernal 1.1, 100-fold faster RNA homology searches.

Bioinformatics 29, 2933–2935 (2013).

38. Nawrocki, E.P. et al. Rfam 12.0, Updates to the RNA Families Database. Nucleic Acids

Res. 43, D130–D137 (2015).

39. Kielbasa, S.M., Wan, R., Sato, K., Horton, P. & Frith, M.C. Adaptive seeds tame

genomic sequence comparison. Genome Res. 21, 487–493 (2011).

40. Tang, H. et al. Screening synteny blocks in pairwise genome comparisons through

integer programming. BMC Bioinformatics 12, 102 (2011).

41. Wang, X. et al. The genome of the mesopolyploid crop species Brassica rapa. Nat.

Genet. 43, 1035–1039 (2011).

Nature Genetics: doi:10.1038/ng.3657

Page 26: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

42. Wu, T.D. & Watanabe, C.K. GMAP, a genomic mapping and alignment program for

mRNA and EST sequences. Bioinformatics 21, 1859–1875 (2005).

43. Blanco, E., Parra, G. & Guigo, R. in Current Protocols in Bioinformatics Version 2

(editoral board, Andreas D Baxevanis) 1–28 (John Wiley & Sons, Inc, 2007).

44. Kim, D. et al. TopHat2, accurate alignment of transcriptomes in the presence of

insertions, deletions and gene fusions. Genome Biol. 14, R36 (2013).

45. Trapnell, C. et al. Transcript assembly and quantification by RNA-Seq reveals

unannotated transcripts and isoform switching during cell differentiation. Nat. Biotech.

28, 511–515 (2010).

46. Yang, Z. PAML 4, phylogenetic analysis by maximum likelihood. Mol. Biol. Evol. 24,

1586–1591 (2007).

47. Li, H. et al. The sequence alignment/map format and SAMtools. Bioinformatics 25,

2078–2079 (2009).

48. Chen, W. et al. Genome-wide association analyses provide genetic and biochemical

insights into natural variation in rice metabolism. Nat. Genet. 46, 714–721 (2014).

49. Edgar, R.C. MUSCLE, multiple sequence alignment with high accuracy and high

throughput. Nucleic Acids Res. 32, 1792–1797 (2004).

50. Drummond, A.J., Suchard, M.A., Xie, D. & Rambaut, A. Bayesian phylogenetics with

BEAUti and the BEAST 1.7. Mol. Biol. Evol. 29, 1969–1973 (2012).

51. Liu, S. et al. The Brassica oleracea genome reveals the asymmetrical evolution of

polyploid genomes. Nat. Commun. 5, 3930 (2014).

52. McKenna, A. et al. The Genome Analysis Toolkit, a MapReduce framework for

analyzing next–generation DNA sequencing data. Genome Res. 20, 1297–303 (2010).

53. Mark J Chaisson and Glenn Tesler. Mapping single molecule sequencing reads using

Basic Local Alignment with Successive Refinement (BLASR), theory and application.

BMC Bioinformatics 13,238 (2012).

54. Feng, C. et al. Biased gene fractionation and dominant gene expression among the

Nature Genetics: doi:10.1038/ng.3657

Page 27: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

subgenomes of Brassica rapa. Plos One 7,e36442 (2012).

55. Schnable, J.C., Springer, N.M. & Freeling, M. Differentiation of the maize subgenomes

by genome dominance and both ancient and ongoing gene loss. Proc. Natl. Acad. Sci.

USA 108, 4069–4074 (2011).

56. Price A L, Patterson N J, Plenge R M, et al. Principal components analysis corrects for

stratification in genome-wide association studies. Nat. Genet. 38, 904–909 (2006).

Nature Genetics: doi:10.1038/ng.3657

Page 28: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

SUPPLEMENTARY FIGURES

Figure 1. K-mer 17 distribution of B. juncea

Nature Genetics: doi:10.1038/ng.3657

Page 29: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

Figure 2. Plot of sub-reads length distribution of PacBio sequencing data.

Nature Genetics: doi:10.1038/ng.3657

Page 30: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

Figure 3. Scaffold chimera error sketch map during genome assembly

Nature Genetics: doi:10.1038/ng.3657

Page 31: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

Figure 4. Pseudo-chromosomes of B. juncea genome. Pseudo-chromosomes were

constructed from two genetic maps of T84-DTC and SY-PM using ALLMAPS with equal

weights. Green lines connected the B. Juncea genome scaffolds to the linkage group of

T84-DTC and yellow lines connected the B. juncea genome scaffolds to the linkage group

of SY-PM.

Nature Genetics: doi:10.1038/ng.3657

Page 32: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

Figure 5. K-mer 17 distribution of B. nigra.

Nature Genetics: doi:10.1038/ng.3657

Page 33: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

Figure 6. Genome assembly assessment of PacBio sub-reads, BAC sequence and

paired-end reads. a, PacBio sub-reads and paired-end reads (inserted Size: 4Kb, 5Kb,

10Kb, 15Kb) alignment result mapping to B. juncea genome sequence. b, Alignment of

BACs and paired-end reads (inserted Size: 4Kb, 5Kb, 10Kb, 15Kb) alignment result

mapping to the B. nigra genome sequence.

Nature Genetics: doi:10.1038/ng.3657

Page 34: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

Figure 7a. Comparison of the main transposable element (TE) types in syntenic region of

B. juncea subgenomes and their ancestors (B. rapa and B. nigra)

Figure 7b. Comparison of the main transposable element types in syntenic region of B.

napus subgenomes and their ancestors (B. rapa and B. oleracea).

Nature Genetics: doi:10.1038/ng.3657

Page 35: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

Figure 8. Procedure for identifying newly formed transposable elements (TEs).

Nature Genetics: doi:10.1038/ng.3657

Page 36: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

Figure 9. PCR amplification of newly identified transposable elements (TEs) in B. rapa, B.

nigra and B. juncea.

Nature Genetics: doi:10.1038/ng.3657

Page 37: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

Figure 10a. Comparison of newly formed transposable elements (TEs) in B. juncea

subgenomes and their ancestors (B. rapa and B. nigra).

Figure 10b. Comparison of newly formed transposable elements (TEs) in B. napus

subgenomes and their ancestors (B. rapa and B. oleracea).

Nature Genetics: doi:10.1038/ng.3657

Page 38: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

Figure 11. Gene loss validation using PCR amplification.

Nature Genetics: doi:10.1038/ng.3657

Page 39: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

Figure 12a. Gene ontology of lost genes from the BjuA of B. juncea.

Figure 12b. Gene ontology of lost genes from the in BjuB of B. juncea.

Nature Genetics: doi:10.1038/ng.3657

Page 40: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

Figure 13. a

Nature Genetics: doi:10.1038/ng.3657

Page 41: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

Figure 14. Characteristics of SNP variations from A subgenomes of B. juncea and B.

napus, and vegetable- and oil-use B. jucnea.

Nature Genetics: doi:10.1038/ng.3657

Page 42: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

Figure 15. Estimate of molecular divergence between two B. juncea subgenomes, two B.

napus subgenomes and their progenitors (B. rapa, B. nigra, B. oleracea).

Nature Genetics: doi:10.1038/ng.3657

Page 43: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

Figure 16. Venndiagram of homoeolog expression dominance in four different

developmental stages of B. juncea. After seeding on Oct. 5th, the stem of yongan1 was

collected 18 weeks after seeding, which is represented by blue; the stem of yongan2 was

collected 20 weeks after seeding, represented by orchid; the stem of yongan3 was

sampled 22 weeks after seeding, corresponding to the green color; the stem of yongan4

was obtained 24 weeks after seeding, as shown in yellow. The total number of dominance

genes in each area was indicated by black numbers. The red stands for the number of

BjuA genes, while the blue stands for the number of BjuB genes

Nature Genetics: doi:10.1038/ng.3657

Page 44: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

Figure 17. Venndiagram of homoeolog expression dominance in different tissues of B.

juncea. Seed coat, stems and stems from mutants are represented by blue, green and

yellow respectively. The stems of both daye3bianzhong and yongan3 were sampled 22

weeks after seeding which is one week before the start of inflation. The total numbers of

BjuA and BjuB genes are shown in red color and blue color. The total number in each area

is indicated in black.

Nature Genetics: doi:10.1038/ng.3657

Page 45: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

Figure 18. KEGG analysis of genes exhibiting homoeolog expression dominance in B.

jucnea.

Nature Genetics: doi:10.1038/ng.3657

Page 46: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

Figure 19. Boxplot of the distribution of Ka values between subgenomes (BjuA and BjuB)

and among homoeolog expression dominance genes as dominant, subordinate and

neutral (non-dominant) in B. juncea. We performed a permutation test with 1000

permutations to assess statistical significance of difference (P < 0.001).

Nature Genetics: doi:10.1038/ng.3657

Page 47: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

Figure 20. Genome-wide homoeolog expression dominance in B. juncea. All numbers

represent the IDs of specific genes that are located in regions identified in the select

sweep analysis between vegetable- and oil-use sub-varieties. Genes involved in

glucosinolate and lipid processes were marked with different colored triangles. Genes

expressed dominantly in either subgenome are marked with a black arrow. Meanwhile,

gene losses are marked with blue rectangles.

Nature Genetics: doi:10.1038/ng.3657

Page 48: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

Figure 21. Heat map for genes involved in auxin and ethylene signal pathways in

vegetable-use (highlighted in green) and oil-use (highlighted in orange) sub-varieties of B.

juncea from the RNA-Seq data.

Nature Genetics: doi:10.1038/ng.3657

Page 49: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

SUPPLEMENTARY TABLES

Table 1. Genome estimation by flow cytometry for B. juncea

Peak value CV (%) Genome size (Mb) O. sativa (Nipponbare)/B. juncea 30.82 /71.55 6.45 /6.52 903.1 O. sativa (Nipponbare)/B. juncea 27.63 /65.85 7.81 /7.76 927.1 O. sativa (Nipponbare)/B. juncea 29.8/71.71 5.33 /5.58 936.1

Mean 922.1

Nature Genetics: doi:10.1038/ng.3657

Page 50: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

Table 2a. Summary of genome sequencing strategy for B. juncea

Paired-end library Insert size Total data (Gb) Depth (×)* Q20 (%)

Illumina reads

180bp 22.56 26.2 92.53 250bp 39.55 46.0 95.4 500bp 14.09 16.4 92.46 3Kbp 15.24 17.7 95.54 3Kbp 16.43 19.1 95.25 5Kbp 17.18 20.0 78.52 8Kbp 7.81 9.1 98.14 8Kbp 3.17 3.7 78.38

10Kbp 2.71 3.2 79.84 10Kbp 4.97 5.8 81.69 15Kbp 1.83 2.1 79.1 15Kbp 2.35 2.7 81.11 17Kbp 3.29 3.8 80.74

Total 151.19 175.8 91.09

Note: The estimated genome size was 0.86 Gb.

Table 2b. Statistic of PacBio sub-reads length distribution

Total read number

Base number (bp)

Depth (X) Sub reads N50 (bp)

Average sup reads (bp)

Longest sub-reads

(bp) 1,053,835 11,088,237,501 12.03 13,981 10,522 74,870

Nature Genetics: doi:10.1038/ng.3657

Page 51: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

Table 3. Summary of B. juncea genome assembly

strategy Contig Contig Scaffold

Hiseq sequencing

Total length (bp) 640,594,512 701,290,321

Total number 48,985 11,891

Max length (bp) 326,883 6,107,082

N50 size (bp) 28,225 710,138

N90 size (bp) 6,024 83,000

Hiseq sequencing + PacBio

Total length (bp) 760,709,244 784,227,516

Total number 32,581 10,784

Max length (bp) 569,668 4,561,631

N50 size (bp) 61,273 855,041

N90 size (bp) 12,728 94,898

Hiseq sequencing + PacBio + BioNano

Total length (bp) 955,000,958 (gap 194,291,714)

Total number 10,581

Max length (bp) 7,842,264

N50 size (bp) 1,523,604

N90 size (bp) 124,389

Note: Scaffolds length less than 1000 bp are excluded.

Nature Genetics: doi:10.1038/ng.3657

Page 52: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

Table 4. Summary of BioNano data collection and assembly statistics

No. molecule/genome

maps Total length

Coverage

Molecule/map N50

Longest molecule/map (bp)

Average molecule/scaffold

Single molecules (> 150 kb)

996,648 205 Gb 222 X 217 Kb 3,174,225 206 Kb

Map assembly 922 1,101 Mb - 1.84 Mb 11,038,396 1.19 Mb

Nature Genetics: doi:10.1038/ng.3657

Page 53: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

Table 6. Summary of genetic map of B. juncea from resequencing of F2 population

Linkage group

Chromosome

Marker number

Total genetic distance (cM)

Average genetic distance (cM)

A01 J01 369 330.59 0.9 A02 J02 259 232.13 0.9 A03 J03 412 293.09 0.71 A04 J04 290 195.34 0.67 A05 J05 299 237.3 0.79 A06 J06 231 222.9 0.96 A07 J07 231 218.12 0.94 A08 J08 243 215.23 0.89 A09 J09 497 431.52 0.87 A10 J10 156 103.02 0.66 B01 J11 164 166.91 1.02 B02 J12 455 406.05 0.89 B03 J13 399 414.27 1.04 B04 J14 170 173.91 1.02 B05 J15 358 346.55 0.97 B06 J16 182 137.28 0.75 B07 J17 226 190.53 0.84 B08 J18 392 395.51 1.01 total total 5,333 4,710.25 0.88

Nature Genetics: doi:10.1038/ng.3657

Page 54: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

Table 7. Summary of a published genetic map of B. junceaa

Linkage group

Chromosome Marker number

Total genetic distance (cM)

Average genetic distance (cM)

A01 J01 96 89.7 0.93 A02 J02 102 69.24 0.68 A03 J03 103 94.93 0.92 A04 J04 61 63.3 1.04 A05 J05 82 93.53 1.14 A06 J06 98 98.02 1 A07 J07 99 71.5 0.72 A08 J08 88 64.11 0.73 A09 J09 110 107.97 0.98 A10 J10 71 68.42 0.96 B01 J11 62 56.42 0.91 B02 J12 120 103.59 0.86 B03 J13 80 59.83 0.75 B04 J14 86 115.44 1.34 B05 J15 66 102.17 1.55 B06 J16 54 75.79 1.4 B07 J17 139 102.24 0.74 B08 J18 114 123.84 1.09 total 1,631 1,560.04 0.96

aData from a published genetic map of B. juncea (Zou et al., BMC Genomics, 2016, 17:

18).

Nature Genetics: doi:10.1038/ng.3657

Page 55: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

Table 8a. Statistics of B. juncea pseudo-chromosomes

Subgenomes Chromoso

me ID Chromosome

size (Mbp) Gaps (%)

Anchored percentage*

(%)

Pearson correlation coefficient

T84-DTC

SY-PM

BjuA (402.12 Mb)

J01 45.32 26.62 11.27 0.96 0.99 J02 25.47 11.80 6.33 0.96 0.90 J03 43.96 8.48 10.93 0.93 0.99 J04 33.18 11.31 8.25 0.94 0.97 J05 38.84 23.74 9.66 0.85 0.99 J06 38.05 12.63 9.46 0.90 0.99 J07 28.55 18.69 7.10 0.93 0.97 J08 29.35 9.15 7.30 0.93 0.99 J09 62.78 17.48 15.61 0.97 0.98 J10 22.35 6.53 5.56 0.94 0.97 NA 34.27 7.38 total 367.85 15.50 91.48

BjuB (547.53 Mb)

J11 32.47 26.12 5.93 0.89 0.99 J12 60.37 22.17 11.03 0.70 0.97 J13 83.57 36.79 15.26 0.60 0.99 J14 28.34 4.80 5.18 0.96 0.97 J15 50.57 20.32 9.24 0.79 0.99 J16 18.74 1.20 3.42 0.84 0.99 J17 44.22 28.69 8.08 0.72 0.97 J18 77.67 36.23 14.19 0.66 0.99 NA 151.58 19.38 total 395.95 26.59 72.32 0.96 0.99

Unknown 5.25 Mbp B. juncea

(954.90 Mb) 763.80 79.99

* Percentage of scaffold length anchored to pseudochromosomes. NA: non-anchored

Nature Genetics: doi:10.1038/ng.3657

Page 56: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

Table 8b. Statistics of B. nigra pseudochromosomes

Chromoso

me ID

Chromosome size

(Mb)

Gaps (%)

Anchored percentage*

(%)

Pearson correlation coefficient

T84-DTC SY-PM

B. nigra (402.05Mbp)

B01 24.69 8.12 6.14 0.90 0.8 B02 37.16 9.26 9.24 0.77 0.99 B03 40.13 10.11 9.98 0.75 0.96 B04 27.51 6.57 6.84 0.85 0.99 B05 33.55 11.10 8.34 0.80 0.95 B06 24.10 5.67 5.99 0.90 0.99 B07 32.34 5.88 8.04 0.97 0.97 B08 47.31 8.09 11.77 0.86 0.98 NA 130.07 14.95 total 266.79 8.30 66.36

* Percentage of scaffold length anchored to pseudochromosomes. NA: non-anchored

Nature Genetics: doi:10.1038/ng.3657

Page 57: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

Table 9. Summary of subgenome size in B. juncea, B. nigra and B. rapaa

B. juncea B. rapa B. nigra

BjuA BjuB Unknow

n BraA BniB

genome size (Mb) 402.12 547.53 5.25 283.82 396.86 gene number 40,256 39,414 380 41,020 47,974

TE content (bp) 115,995,819 197,386,909 2,737,16

8 82,607,905 149,046,980

aB. rapa genome was from Wang et al.

Nature Genetics: doi:10.1038/ng.3657

Page 58: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

Table 10. Summary of genome sequencing strategy for B. nigra

Paired-end library Insert size Total data (Gb) Depth (×)* Q20 (%)

Illumina reads

180bp 11.17 18.6 95.98 200bp 11.55 19.3 96.23 300bp 4.59 7.6 94.28 400bp 5.87 9.8 92.02 3Kbp 7.19 12.0 79.1 4Kbp 4.76 7.9 79.15 5Kbp 4.77 8.0 80.41 8Kbp 1.37 2.3 81.07 10Kbp 4.03 6.7 80.56 15Kbp 2.28 3.8 81.56

Total 57.59 95.99 88.75

Note: The estimated genome size was 0.591 Gb.

Nature Genetics: doi:10.1038/ng.3657

Page 59: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

Table 11. Summary of B. nigra genome assembly

Category BniB

Contig Scaffold

Total length (bp) 354,586,867 396,857,455

Total number 25,103 5,120

Max length (bp) 332,462 5,330,927

N50 size (bp) 31,119 557,272

N90 size (bp) 6,778 66,961

Note: Scaffolds length less than 1000 bp were excluded

Nature Genetics: doi:10.1038/ng.3657

Page 60: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

Table 12. PacBio sub-reads validation for B. juncea genome assembly

Sub-reads ID

Sub-read length (bp)

Alignment length (bp)

Coveragea (%) Identityb

(%) 1 46984 46984 100.00% 96.39% 2 45676 45676 100.00% 96.13% 3 44076 42926 97.39% 99.47% 4 42564 42490 99.83% 95.68% 5 42545 42545 100.00% 99.14% 6 42331 42331 100.00% 99.39% 7 41857 41857 100.00% 98.69% 8 41658 41658 100.00% 99.92% 9 41653 41653 100.00% 99.97%

10 41529 41529 100.00% 99.89%

a. Coverage = Alignment length / sub-reads length;

b. Identity = Identical length / Alignment length.

Nature Genetics: doi:10.1038/ng.3657

Page 61: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

Table 13. BAC validation for B. nigra genome assembly

Accession BAC length (bp) Number of N a Status Alignment length (bp) Coverageb (%) Identityc (%)

KC795992.1 51,800 0 random 51,800 100.00 99.97 KC795993.1 73,424 500 ordered 70,562 96.76 99.80 KC795994.1 47,925 0 random 47,925 100.00 99.95 KC795995.1 102,525 0 random 102,098 99.58 99.40 KC795996.1 163,490 200 ordered 163,250 99.98 99.99 KC795997.1 51,796 0 random 51,796 100.00 99.97 KC795998.1 73,803 0 random 72,848 98.71 99.99 KC795999.1 42,114 0 random 42,114 100.00 100.00 KC796000.1 53,921 0 random 53,059 98.40 99.93 KC796001.1 85,295 0 random 84,340 98.88 99.99 KC796002.1 73,752 0 random 72,797 98.71 99.99 KC796003.1 70,444 0 random 70,444 100.00 99.99 KC796004.1 118,270 900 ordered 107,447 91.55 98.16 KC796005.1 71,086 100 ordered 70,986 100.00 99.98 KC796006.1 56,764 0 random 53,933 95.01 99.98

a. All BACs were from Sharma et al., PloS One, 2014, 9(4): e93260.

b. BACs with N gap indicate they are 'working draft' sequence. They consist of ordered contigs and concatenate with constant Ns.

c. Coverage = Alignment length/(BAC length - N number).

d. Identity = Identical length/Alignment length.

Nature Genetics: doi:10.1038/ng.3657

Page 62: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

Table 14a. Completeness inspection based on CEG databse for B. juncea and B. nigra

Species Number of CEGsa Percent of CEGs Number of conserved CEGsb Percent of conserved CEGs B. juncea 453 98.9 % 245 98.8 % B. nigra 458 100 % 248 100 %

a. number of genes within 458 CEGs presented in assembly results b. number of genes within 248 highly conserved CEGs presented in assembly results

Nature Genetics: doi:10.1038/ng.3657

Page 63: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

Table 14b. Genome completeness assessment for B. juncea and B. nigra based on EST dataset

Species Dataset Number of EST

Total length of EST (bp)

sequence coverred by one scaffold(>50%)

Number Percent

B. juncea >=500 bp 23,002 31,628,607 22,665 98.53% >=1000

bp 13,152 24,615,370 13,053 99.25%

B. nigra >=500 bp 18,344 25,187,114 17,878 97.46% >=1000

bp 10,729 19,680,640 10,619 98.97%

Nature Genetics: doi:10.1038/ng.3657

Page 64: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

Table 16a. Statistics of confident newly formed TEs and common TEs

Categories B. juncea B. napus

BjuA BraAa BjuB BniB BnaA BraAb BnaC BolC Confident 1,108 805 1,063 1,089 977 978 1,267 1,357 Commonc 43,843 38,900 57,980 58,624 37,452 39,682 106,216 111,932

Note: a. Ancestral genome of B. rapa and subgenome of B. juncea. b. Ancestral genome of B. rapa and subgenome of B. napus. c. Same transposable

element (TEs) been found in the syntenic position of the genome.

Nature Genetics: doi:10.1038/ng.3657

Page 65: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

Table 16b. Classification of confident newly identified TEs

Categories B. juncea B. napus

BraAa BjuA BniB BjuB BraAb BnaA BolC BnaC ClassI/DIRS 9 17 11 14 12 13 21 25 ClassI/LINE 37 64 45 34 65 70 81 77 ClassI/LTR 8 18 29 24 9 8 18 5 ClassI/LTR/Copia 76 82 107 92 111 61 151 104 ClassI/LTR/Gypsy 53 78 108 105 78 72 145 158 ClassI/PLE|LARD 39 71 61 58 60 67 95 94 ClassI/SINE 24 20 21 18 22 22 45 37 ClassI/SINE|TRIM

0 0 0 0 0 0 0 0

ClassI/TRIM 7 12 10 4 5 7 6 4 ClassI/Unknown 1 1 2 1 3 2 2 3 ClassII/Helitron 7 9 3 0 8 5 14 12 ClassII/MITE 175 223 242 266 200 179 188 211 ClassII/Maverick 1 3 1 2 2 1 2 0 ClassII/TIR 81 115 138 170 114 78 183 149 ClassII/Unknown 11 15 10 19 16 15 24 24 PotentialHostGene

19 7 13 11 10 22 30 35

Unknown 257 373 262 271 263 355 352 329 total 805 1,108 1,063 1,089 978 977 1,357 1,267 Note: a. Ancestral genome of B. rapa and subgenome of B. juncea. b. Ancestral genome of B. rapa and subgenome of B. napus.

Nature Genetics: doi:10.1038/ng.3657

Page 66: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

Table 17. Statistics of origination of newly identified TEs

Type B. juncea B. napus

BjuA BjuB BnaA BnaC Intra-subgenome 738 755 339 651 Inter-subgenome 163 147 23 18

Unknown 207 161 599 570

Note: Three origin of new transposable element (TEs) were distinguished (Intra-subgenome= internal TEs, Inter-subgenome = TEs driving from opposite

subgenome, unknown = origin could not be identified).

Nature Genetics: doi:10.1038/ng.3657

Page 67: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

Table 18. Summary of predicted genes in B. juncea, B. napus and their ancestors (B. rapa, B. nigra and B. oleracea)

Category BraA BniB BolC B. juncea B. juncea

total

B. napus B. napus total BjuA BjuB Unknow

n BnaA BnaC Unknown

Gene number 41,020 49,826 45,758 40,256 39,414 380 80,050 44,452 56,055 533 101,040

Gene length (bp) 82,756,981

95,089,874

80,521,861

86,175,681

82,169,664

516,981 168,862,326 90,932,969

105,697,258

613,454 197,243,681

Average gene length (bp) 2,017.48 1,908.44 1,759.73 2,140.69 2,084.78 1,360.48 2,109.46 2,045.64 1,885.60 1,150.95 1,952.13

Exon number 206,584 225,634 208,039 188,179 176,679 1,193 366,051 227,265 266,792 1,662 495,719

Exon length (bp) 47,916,296

55,849,862

47,242,763

45,642,107

42,974,822

324,206 88,941,135 47,174,160 53,174,968 312,555 100,661,68

3 Average exon length (bp) 231.95 247.52 227.09 242.55 243.24 271.76 242.97 207.57 199.31 188.06 203.06

CDS length per gene (bp) 1,168.12 1,120.90 1,032.45 1,133.80 1,090.34 853.17 1,111.07 1,061.24 948.62 586.41 996.26

Exon num per gene 5.04 4.53 4.55 4.67 4.48 3.14 4.57 5.11 4.76 3.12 4.91

Intron number 165,564 175,808 162,281 147,923 137,265 813 286,001 182,813 210,737 1,129 394,679

Intron length (bp) 34,651,585

39,196,751

33,217,700

41,671,111

40,192,063

237,626 82,100,800 34,635,213 42,663,705 248,973 77,547,891

Average intron length (bp) 209.29 222.95 204.69 281.71 292.81 292.28 287.06 189.46 202.45 220.53 196.48

Intron number per gene 4.04 3.53 3.55 3.67 3.48 2.14 3.57 4.11 3.76 2.12 3.91

Intron length per gene (bp) 844.75 786.67 725.94 1,035.15 1,019.74 625.33 1,025.62 779.16 761.10 467.12 767.50

Nature Genetics: doi:10.1038/ng.3657

Page 68: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

Table 19. Statistics of genes annotated by different databases

Database

B. juncea BniB gene numbera (49,826)

Percentage

Total gene numbera

(80,050)

Percentage

BjuA gene numbera (40,256)

Percentage

BjuB gene numbera (39,414)

Percentage

BjuO gene numbera

(380)

Percentage

NR 77,496 96.81% 39,311 97.65% 37,810 95.93% 359 94.47% 47,056 94.44% SwissProt 57,422 71.73% 29,447 73.15% 27,730 70.36% 232 61.05% 38,135 76.54%

COG 25,522 31.88% 20,588 51.14% 18,905 47.97% 130 34.21% 17,031 34.18% GO 51,906 64.84% 26,746 66.44% 24,968 63.35% 184 48.42% 39,275 78.82%

KEGG 16,016 20.01% 8,243 20.48% 7,724 19.60% 43 11.32% 9,948 19.97% Gene

numberb 78,290 97.80% 39,669 98.54% 38,244 97.03% 361 95.00% 47,186 94.70%

a. Number of genes that have been annotated in a corresponding database

b. Number of genes that have been annotated in a corresponding genome/subgenome

Nature Genetics: doi:10.1038/ng.3657

Page 69: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

Table 20a. Summary of unigenes from B. juncea and B. nigra transcriptomes

Unigenes Length (bp) B. juncea B. nigra

Number Percentage Number Percentage 200-300 18,894 34.33% 10,071 29.25% 300-500 13,145 23.88% 6,020 17.48%

500-1000 9,850 17.90% 7,615 22.11% 1000-2000 9,093 16.52% 7,597 22.06%

>2000 4,059 7.37% 3,132 9.10% Total 55,041 100% 34,435 100.00%

Table 20b. Summary of unigene lengths from B. juncea and B. nigra transcriptomes

Species Total unigenes length (bp) N50 (bp)

Average unigenes length

(bp)

B. juncea 41,166,472 1,302 747.92 B. nigra 29,912,959 1,411 868.68

Nature Genetics: doi:10.1038/ng.3657

Page 70: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

Table 21. Summary of non-coding RNAs and pseudogenes

Class B. juncea B. nigra

Number Family Number Family lncRNA 21 2 27 21 sRNA 3725 151 80 35 tRNA 2638 56 723 2 rRNA 511 3 62 5 miRNA 1402 830 719 147 snRNA 15418 612 1,395 132 CD-box 10265 332 1,181 92 HACA-box 3164 248 78 31 scaRNA 1189 18 3 1 splicing 800 14 133 8 Pseudogenes 14,676 4,489

Nature Genetics: doi:10.1038/ng.3657

Page 71: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

Table 22. Statistic of syntenic orthologs missing in Brassica genome/subgenome comparison

loss type BraA vs BjuA BniB vs BjuB BjuA vs

BraA BjuB vs

BniB BraA vs BnaA

BolC vs BnaC

BnaA vs BraA

BnaC vs BolC

Whole genes missing DNA sequences 183(234) 141(189) 184(303) 208 (279) 81(86) 24(66) 157(304) 162(276)

Sequence matches on random 101 56 199 200 152 301 2303 2624

Transposition 1,495 1,085 3,819 2,617 1893 1785 1465 1187

Sequence matches outside synteny blocks 485 386 865 504 313 356 669 770

Synteny-excluded by block 1,135 823 764 718 538 656 542 866

Gene is predictable 159 109 429 534 65 317 77 116

Partial loss 135(194) 112(169) 349(845) 255(650) 172(279) 370(679) 213(469) 199(411)

Pseudogene 69(106) 29(57) 167(583) 82(351) 205(420) 364(721) 29(74) 27(73)

Gmap failed 8 4 25 14 9 27 36 42

Total potential loss 3,917 2,878 7,832 5,867 3755 4908 5939 6365

Note: (1) Numbers outside brackets are validated by sequencing reads. Numbers inside brackets are predicted. (2) We calculated gene loss number including whole genes missing DNA sequences, partial loss and psequdogene. (3) Gmap failed: Genes lost predicted by Gmap.

Nature Genetics: doi:10.1038/ng.3657

Page 72: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

Table 23. Statistics of syntenic region among A-subgenomes of Brassica

Intra-chromosome Inter-chromosome Length (Mb) Percent (%) Length (Mb) Percent (%)

BjuA : BraA 34.18 8.50 43.44 10.80 BnaA : BraA 8.64 2.74 1.3 0.41

Note: (1) intra-chromosome: disorder synteny region between homologous chromosomes (2) inter-chromosome: synteny region between nonhomologous chromosomes

Nature Genetics: doi:10.1038/ng.3657

Page 73: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

Table 24. Summary of the 17 resequencing of B. juncea accessions

Accession Sample collection Variety type Reads number Read length (bp) Base Number (bp) Mapped ratio (%) Average

Depth (X) Coverage ratio (%)

CN04 China (Zhejiang) vegetable 345,104,000 101 31,059,360,000 94.75 24 90.2

CN18 China (Zhejiang) vegetable 107,030,406 101 10,810,071,006 96.72 8 85.64

CN40 China (Zhejiang) vegetable 114,726,912 101 11,587,418,112 97.67 9 85.1

CN46 China (Sichuan) vegetable 110,960,828 101 11,207,043,628 97.01 8 84.75

CN48 China (Zhejiang) vegetable 106,201,702 101 10,726,371,902 96.81 8 86.62

CN53 China (Zhejiang) vegetable 109,173,560 101 11,026,529,560 97 9 85.05

CN58 China (Sichuan) vegetable 107,415,060 101 10,848,921,060 97.18 8 84.95

CN59 China (Hebei) vegetable 109,780,920 101 11,087,872,920 94.21 8 82.64

CN78 China (Zhejiang) vegetable 123,577,972 101 12,357,797,200 97.8 11 92.27

CN79 China (Zhejiang) vegetable 107,560,118 101 10,863,571,918 95.69 9 83.71

AU213 Australia oilseed 108,148,800 101 10,923,028,800 96.87 8 83.9

CN02 China (Ningxia) oilseed 102,722,874 101 10,375,010,274 97.4 8 82.18

CN74 China (Tibet) oilseed 105,557,312 101 10,661,288,512 96.71 8 83.99

CN77 China (Tibet) oilseed 173,714,344 101 17,545,148,744 96.41 12 85.77

EU07 France oilseed 100,322,358 101 10,132,558,158 95.99 7 82.22

EU11 Ukraine oilseed 103,255,648 101 10,428,820,448 95.24 7 82.38

IN30 India oilseed 93,927,978 101 9,486,725,778 96.99 7 82.2

Nature Genetics: doi:10.1038/ng.3657

Page 74: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

Table 25. Summary of SNP variations in 17 B. juncea accessions, 4 B. napus accessions and 26 B .rapa accessions

Species Samples Main_usage Hetero_ratio Integrity Total_SNPs Synonymous_SNPs Non_synonymous_SNPs

B.juncea

BjuA vegetable 7.48% 86.92% 1,518,243 209,363 115,164 CN59 vegetable 36.84% 82.68% 1,925,241 226,224 121,531 CN40 vegetable 33.02% 85.98% 662,198 69,891 38,632 CN46 vegetable 25.49% 85.38% 997,120 111,394 60,297 CN48 vegetable 41.91% 87.75% 445,939 35,286 21,666 CN53 vegetable 28.45% 85.54% 1,121,548 119,990 65,760 CN58 vegetable 27.02% 85.36% 949,864 102,278 56,783 CN04 vegetable 26.54% 91.77% 1,336,000 122,785 67,077 CN18 vegetable 25.55% 85.71% 1,084,829 114,249 62,565 CN02 oilseed 22.07% 81.83% 1,518,012 170,440 93,243 EU07 oilseed 20.64% 81.97% 1,568,160 174,909 95,530 AU213 oilseed 21.73% 84.15% 1,560,846 174,249 94,803 CN74 oilseed 45.43% 83.69% 1,881,218 210,599 115,801 EU11 oilseed 28.13% 81.98% 1,638,861 185,494 101,561 IN30 oilseed 31.10% 81.62% 1,213,381 136,492 75,111 CN77 oilseed 47.78% 86.76% 2,097,376 216,523 119,518 CN79 vegetable 19.30% 84.27% 1,845,092 198,589 106,901 CN78 vegetable 40.53% 94.60% 667,413 42,459 25,453

B.napus

Darmor-bzh oilseed 12.42% 70.98% 1,114,029 139,660 80,137 Yudal oilseed 16.33% 92.79% 1,493,504 59,834 44,084 Bristol oilseed 15.49% 89.90% 1,478,902 38,425 29,727

Aburamasari oilseed 16.41% 91.20% 1,448,464 56,653 42,536

Nature Genetics: doi:10.1038/ng.3657

Page 75: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

Aviso oilseed 14.54% 91.44% 1,484,373 19,957 16,498

B.rapa

caizi-1 ssp.oleifera 3.94% 30.42% 32,045 16,509 14,263 caizi-2 ssp.oleifera 3.33% 32.59% 33,716 17,332 15,248 caizi-3 ssp.oleifera 4.23% 30.50% 32,113 16,526 14,217

dabaicai-2 ssp.pekinensis 1.08% 88.05% 61,209 32,366 28,134 dabaicai-3 ssp.pekinensis 1.59% 85.16% 70,622 37,301 32,146 dabaicai-4 ssp.pekinensis 1.49% 91.58% 76,857 40,556 35,104 dabaicai-5 ssp.pekinensis 1.71% 89.95% 78,357 41,440 35,528

ouzhouwujing-1 ssp.rapa(European) 1.76% 33.65% 34,980 18,451 15,895 ouzhouwujing-2 ssp.rapa(European) 1.04% 28.76% 29,929 15,821 13,775 ouzhouwujing-3 ssp.rapa(European) 1.31% 25.38% 26,453 13,944 12,150 ouzhouwujing-4 ssp.rapa(European) 2.69% 22.82% 23,988 12,454 10,871 ouzhouwujing-5 ssp.rapa(European) 1.30% 28.35% 29,141 15,461 13,284

xiaobaicai-1 ssp.chinensis 1.20% 88.34% 93,831 49,387 43,266 xiaobaicai-2 ssp.chinensis 1.45% 93.65% 101,323 53,548 46,246 xiaobaicai-3 ssp.chinensis 30.49% 77.46% 86,728 32,124 28,147 xiaobaicai-4 ssp.chinensis 0.67% 87.68% 91,055 48,517 41,871 xiaobaicai-5 ssp.chinensis 0.70% 79.28% 83,444 44,577 38,233

yazhouwujing-1 ssp.rapa(China) 0.88% 27.35% 27,655 14,835 12,568 yazhouwujing-2 ssp.rapa(China) 2.32% 33.41% 34,217 17,969 15,440 yazhouwujing-3 ssp.rapa(China) 4.66% 40.91% 43,260 22,094 19,132 yazhouwujing-4 ssp.rapa(China) 3.23% 32.88% 32,984 17,061 14,828 yazhouwujing-5 ssp.rapa(China) 3.48% 33.07% 33,190 17,182 14,840 youcai-sarson-1 ssp.tricolaris 1.75% 33.94% 36,041 18,979 16,416 youcai-sarson-2 ssp.tricolaris 0.64% 33.70% 35,531 18,810 16,469 youcai-sarson-3 ssp.tricolaris 0.63% 93.27% 101,565 53,675 47,199

Nature Genetics: doi:10.1038/ng.3657

Page 76: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

youcai-sarson-4 ssp.tricolaris 1.30% 96.01% 107,567 56,871 49,244

Nature Genetics: doi:10.1038/ng.3657

Page 77: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

Table 26a. Some instance picked to validate the assumption that Ks method cannot be used to estimate the divergence time.

Range of Ks Gene1 Gene2 Ks Gene

length (bp) Synonymo

us sites Substitutions Synonymous substitutions PAML KaKs_calculator

0.0045-0.0060

Bra012510 BjuA027390 0.0058 0.0052 720 192.178 1 1 Bra000412 BjuA011847 0.0053 0.0040 849 251.397 16 1 Bra012311 BjuA027207 0.005 0.0046 915 216.978 5 1 Bra000416 BjuA011843 0.0046 0.0043 912 231.189 4 1

0.0030-0.0045

Bra011481 BjuA003650 0.0042 0.0040 1086 252.333 2 1 Bra028624 BjuA004368 0.0037 0.0035 1074 285.792 1 1 Bra011334 BjuA003498 0.0037 0.0035 1119 290.041 3 1 Bra033402 BjuA029144 0.0037 0.0034 1167 293.969 3 1

0.0015-0.0030

Bra000878 BjuA013088 0.0027 0.0025 1665 397.661 2 1 Bra006602 BjuA022956 0.0022 0.0020 1986 500.723 2 1 Bra013510 BjuA002918 0.0022 0.0020 1989 491.097 2 1 Bra031704 BjuA031993 0.0016 0.0016 2583 631.875 1 1

0.0000-0.0015

Bra005954 BjuA022654 0.0013 0.0014 3366 726.106 4 1 Bra024449 BjuA029097 0.0011 0.0011 3681 919.882 2 1 Bra028276 BjuA038149 0.0007 0.0003 2826 738.721 3 2.34E-01 Bra033356 BjuA013654 0.0003 NA 3345 803.881 4 1.29E-05

Nature Genetics: doi:10.1038/ng.3657

Page 78: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

Table 26b. Some instance picked between BraA and BnaA to validate the assumption that Ks method cannot be used to estimate the divergence time

Range of Ks Gene1 Gene2 PAML KaKs_calculator

Method Ks Method Ks Length (bp) Synonymous sites Substitutions Synonymous

substitutions

0.0045-0.0060

Bra008081 BnaA02g16400D NG 0.0050 YN 0.0050 930 200.00 5 1 Bra020309 BnaA02g06990D NG 0.0053 YN 0.0053 831 190.45 3 1 Bra009044 BnaA10g21970D NG 0.0049 YN 0.0053 825 187.64 2 1 Bra020161 BnaA02g05380D NG 0.0046 YN 0.0045 912 223.44 2 1

0.0030-0.0045

Bra036942 BnaA10g08820D NG 0.0042 YN 0.0033 1,131 304.75 1 1 Bra009533 BnaA10g26610D NG 0.0044 YN 0.0046 1,080 220.36 4 1 Bra009341 BnaA10g23000D NG 0.0036 YN 0.0033 1,191 301.15 1 1 Bra035336 BnaA02g23350D NG 0.0037 YN 0.0038 1,038 261.05 3 1

0.0015-0.0030

Bra009058 BnaA10g22100D NG 0.0026 YN 0.0024 1,680 420.86 2 1 Bra009091 BnaA10g25210D NG 0.0022 YN 0.0020 1,872 495.44 1 1 Bra035396 BnaA04g07390D NG 0.0023 YN 0.0019 2,217 517.96 4 1 Bra040547 BnaA05g33710D NG 0.0021 YN 0.0021 1,917 469.05 2 1

0.0000-0.0015

Bra022504 BnaA02g11830D NG 0.0013 YN 0.0012 3,189 812.40 1 1 Bra011666 BnaA01g01450D NG 0.0013 YN 0.0004 9,051 2504.36 5 1 Bra011665 BnaA01g01460D NG 0.0010 YN 0.0012 3,402 847.45 3 1 Bra011860 BnaA01g05510D NG 0.0005 YN 0.0011 4,287 904.14 9 1

Nature Genetics: doi:10.1038/ng.3657

Page 79: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

Table 27a. Detailed sample information of homoeolog expression dominance in B. juncea

Run SRA number Strain PE/SE Data(MB) Tissue Stage Treatment

RR1822192 SRS859889 B. juncea (Czern) L. PAIRED 7,145 MB seedings two weeks control SRR1822193 SRS859888 B. juncea (Czern) L. PAIRED 6,788 MB seedings two weeks salinity

SRR1718914 SRS794662 B. juncea (Czern) L. PAIRED 3,706 MB whole seedings 7-days control

SRR1718916 SRS794664 B. juncea (Czern) L. PAIRED 2,582 MB whole seedings 7-days high temperature

SRR1718918 SRS794666 B. juncea (Czern) L. PAIRED 1,918 MB whole seedings 7-days drought

SRR2953668

SRS1173974 resynthesized SINGLE 208 MB silique walls after 21 days pollination -

SRR2953675

SRS1173967 resynthesized SINGLE 735 MB leaves after 3 days of flowering -

SRR1269499 SRX530145 B. juncea var varuna PAIRED 20,295 MB mixa young flower buds stage -

SRR807368 SRS406672 B. juncea PAIRED 3,815 MB seed coat unknown - SRR380274 SRX108496 B. juncea var. tumida SINGLE 357 MB stems 22 weeks after seeding -

SRR380273 SRX108497 B. juncea var. tumida PAIRED (PCR) 3,023 MB mixb - -

SRR380275 SRX108498 B. juncea var. tumida SINGLE 340 MB stems of Yong’an1 18 weeks after seeding - SRR380276 SRX108499 B. juncea var. tumida SINGLE 323 MB stems of Yong’an2 20 weeks after seeding - SRR380277 SRX108500 B. juncea var. tumida SINGLE 351 MB stems of Yong’an3 c 22 weeks after seeding - SRR380278 SRX108501 B. juncea var. tumida SINGLE 338 MB stems of Yong’an4 25 weeks after seeding -

T84-66 PRJNA285130 B. juncea var. tumida PAIRED 5,384MB mixc - -

Note: mixa - a pooled sample from inflorescence, leaf,pod and seed mixb - a pooled sample from yongan 1-4 mixc - a pooled sample from root, inflorescence, stem, seed and leaf

Nature Genetics: doi:10.1038/ng.3657

Page 80: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

Table 27b. Homoeolog expression dominance in B. juncea varieties

No. Sample name Gene number with 2 FC

expression Non-expression genes

Binomial test p-value

Percentage of dominance genes in all genes (80,050) BjuA BjuB Non-dominance

1 Indian Control 2,813 3,353 6,921 1,575 6.49E-12 15.41% 2 Indian High

Temperature 3,224 3,810 5,967 1,661 2.97E-12 17.57%

3 Indian Drought 3,030 3,614 6,400 1,618 8.25E-13 16.60% 4 Indian Variant Salinity 2,880 3,403 5,905 2,474 4.42E-11 15.70% 5 Indian Variant Control 2,810 3,358 6,128 2,366 3.18E-12 15.41% 6 Sichuan Yellow 3,163 3,733 5,212 2,554 7.09E-12 17.23% 7 Varuna 2,892 3,432 7,712 626 1.19E-11 15.80% 8 T84-66 2,606 3,026 8,226 804 2.33E-08 14.07% 9 S.AABB 3,602 3,558 4,136 3,366 6.11E-01 17.89%

10 L.AABB 3,754 3,382 4,879 2,647 1.12E-05 17.83% 11 Daye3bianzhong 3,024 3,574 6,035 2,029 1.36E-11 16.48% 12 yongan 3,032 3,496 5,932 2,202 9.88E-09 16.31% 13 yongan1 2,967 3,445 5,985 2,265 2.53E-09 16.02% 14 yongan2 3,011 3,513 5,842 2,296 5.45E-10 16.30% 15 yongan3 2,853 3,382 6,036 2,391 2.22E-11 15.58% 16 yongan4 2,788 3,305 6141 2428 3.73E-11 15.22% Note: FC, fold change. Binomial test was applied to detect the significance of dominant genes number between subgenomes.

Nature Genetics: doi:10.1038/ng.3657

Page 81: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

Table 31. Selective sweep analysis for B. juncea

#Chr Chr_len (bp) Fst πvegetable πoilseed θπ

J01 45,320,972 0.25049 0.00127 0.00121 1.04845 J02 25,468,329 0.24086 0.00166 0.00145 1.14438 J03 43,957,779 0.22047 0.00144 0.00131 1.09452 J04 33,175,529 0.29815 0.00142 0.00131 1.08362 J05 38,841,303 0.28376 0.00138 0.00099 1.38773 J06 38,054,634 0.25178 0.00151 0.00121 1.25646 J07 28,547,970 0.19727 0.00121 0.00115 1.04697 J08 29,345,952 0.26835 0.00166 0.00170 0.97189 J09 62,778,564 0.18350 0.00156 0.00146 1.06902 J10 22,354,451 0.19295 0.00166 0.00145 1.14481 J11 32,467,241 0.25257 0.00099 0.00102 0.96451 J12 60,369,535 0.20894 0.00118 0.00126 0.93017 J13 83,567,156 0.22871 0.00088 0.00116 0.76154 J14 28,343,526 0.23934 0.00087 0.00117 0.74025 J15 50,573,455 0.16598 0.00112 0.00121 0.91916 J16 18,736,220 0.26912 0.00133 0.00180 0.74141 J17 44,223,734 0.21805 0.00113 0.00127 0.89606 J18 77,673,196 0.22192 0.00079 0.00097 0.81239

Average 42,433,308 0.23290 0.00128 0.00128 1.00074

Nature Genetics: doi:10.1038/ng.3657

Page 82: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

Table 34. Primers used for gene loss validation

Gene ID (Non-loss) Primers Gene ID (Loss) Primers

Bra039553/BjuA005100 F: ATGAAGCAAATATTTGGGAAATTAR: ACTAGCAGAAGTTTTCCCAGT

Bra006842 F: ATGGACAACAACAAAAGGAAAG R:TCATGCTCCGCTCTTTGGTC

Bra001721/BjuA012618 F: ATGTCTTCTGATTACTCACCTTR: GCTGTTGACTCTTTCTTCAGG

Bra007575 F: ATGGACATGATCAGTCAATTGT R: CTAAACCCAATAACACCACCC

Bra010005/BjuA023812 F: ATGGAGGAAGTTGAAGCTGCR: GGTGTGAGCTGACTGGGAG

Bra007462 F: ATGGCGAAGAGTTTGTGCATC R: TCAGAGAAGAAGGCGTAGAC

Bra005653/BjuA001706 F: ATGGAGTTTGTGAAATCGTTGGR: GTGACCATCTGTTCTCATCAGAA

Bra001370 F:GGGTCTATGAAGTCGGAGGA R:GGATCGGTCGAAATTATGCT

Bra025928/BjuA022182 F: ATGAGATCCTCATCGACTTCTR: CAAGGAGATCAAAGCATCGTC

Bra024553 F:GTAGGGAGGTGATCCATTCG R:TTAACAGGATCCACGAGCAC

Bra035936/BjuA032524 F: ATGGGTGGAGGGTCAAAGAGR: GGCAAGACCGTTTCTAGTCT

Bra014930 F:AAGATGCTTGGATGAGAGTGG R:CACTTCGAATTAACCTTTCTTGG

Bra008833/BjuA039310 F: ATGTTACCAAAGTTCGATCCGR: TCGACTGGAATGCAGCTCAT

Bra022306 F:TGCTGGTGCTCTTACCAATC R:CCGAGTTGCTTCCATCAGTA

Bra024822/BjuA024592 F: ATGGCGAAAAATCACGGCGTR: ATCATCATCATCCATTTCAAAGC

Bra016859 F:AGGAGAATACGAGGAAGGCA R:TAGCTGGCAAACATGTCCTC

Bra041032/BjuA015249 F: ATGATTTCCTTTGTTGGTCGAGR: TTAATTTGCTTTAGCCTTTGGAG

Bra018473 F:AACTGTGCAGCAGGTTTGAC R:CCCAACCATATTTCAACCAA

Bra000047/BjuA010390 F: ATGGCCGCAATCAGTTTCTCR: CTACTTAAACATATCGGCAAGT

Bra019268 F:TCACAGGGATGGCAACTTTA R:CTGTTGATGCTCCGATGAAT

BniB019213/BjuB024015 F: ATGGGCTCCCCTGTCTCGTR: CTCATCATCTCATTTTCAATCC

BniB002837 F:GATGCTTCTTGCCTCATCAA R:CCCAACACAGCAACGTTATC

Nature Genetics: doi:10.1038/ng.3657

Page 83: SUPPLEMENTARY MATERIALS AND METHODS for · 2018-09-24 · (Supplementary Table 2a and Supplementary Table 10). Additionally, approximately 10 X coverage of genome sequences from 17

BniB048496/BjuB021004 F: ATGCCTGTGTCCGTACATTC R: GACACGTCGGTACTCGTCT

BniB011344 F: ATGGCTCATGATGATTATGTAAA R: TCATGCTTTCGTCCTGCGC

BniB006554/BjuB010456 F: ATGAGAAACGTAGGGAGTTCG R: GAACCTTGTGTTTGATGGTCG

BniB013229 F:TGAACGTCTCTCCGTATTGG R:TTGGCTGAGAAGATGACGAG

BniB002930/BjuB003659 F: AAACCTTCCGCAGATTCTAGC R: TTAAACCATCTTTGTCACCGC

BniB016126 F:ATCACAGATGGAGCAGCTTG R:TGGGAAACGATGGATGACTA

BniB014100/BjuB007950 F: ATGAAACTAGAGCTAATCCTCG R: AGAGATATTACGAGTAACGTCT

BniB025882 F:AAACATGATTTCCGGAGGAG R:TTGCGGCTAGAATTTGGATA

BniB011005/BjuB028236 F: ATGGCACAAAAACTGGAAGCCA R: AGTGAAGGGCAGAACATGGA

BniB020222 F:GGTTCTCCTGGTTCCTGTGT R:ACCCTTTGGTTCAAGCTCAC

BniB010483/BjuB035357 F: ATGTTTCCCAGATTAGGTCGA R: TGCCTTTGCTGCTTTTAACTC

BniB025866 F:GAGAGAGCTCAGGCCAAGTT R:GCCAAGACTCTTCCTACGCT

BniB000067/BjuB037529 F: ATGTCGTCGTCTTCTCCGAG R: TCATGAGTCTGTAGCAGTAATAG

BniB025956 F:CAGGAAGAGTTGCTGTGGAA R:CACTACCTCCGAAGCTGTCA

BniB000071/BjuB037537 F: ATGGTGGCAGAAGCCATGAG R: TTAATACAGATAGATTTTGGTTTCC

BniB019567 F:TGACGATTGATCTTGATGCAG R:CCTTTGTTCTCAAAGTTCGGA

BniB000074/BjuB037542 F: ATGAAGTCATTAGAGAGAGTGG R: CTACCAGAACCGGTCTTTATTG

BniB024798 F:CAAACTCGGCAGAAATGAGA R:CCATCGTTCGATTCCTCTTT

Nature Genetics: doi:10.1038/ng.3657