· web viewthe x-axis indicates the number of times a k-mer occurs in the genome. the y-axis...

33
Supplemental Information The Tartary buckwheat genome provides insights into rutin biosynthesis and abiotic stress tolerance Lijun Zhang 1,2,3‡ , Xiuxiu Li 4,5‡ , Bin Ma 4‡ , Qiang Gao 4‡ , Huilong Du 4,5‡ , Yuanhuai Han 2,3,6 , Yan Li 4 , Yinghao Cao 4 , Ming Qi 4 , Yaxin Zhu 7 , Hongwei Lu 4,5 , Mingchuan Ma 1,2,3 , Longlong Liu 1,2,3 , Jianping Zhou 1,2,3 , Chenghu Nan 1,2,3 , Yongjun Qin 1,2,3 , Jun Wang 8 *, Lin Cui 1,2,3 *, Huimin Liu 1,2,3 *, Chengzhi Liang 4,5 *, and Zhijun Qiao 1,2,3 * 1 Institute of Crop Germplasm Resources Research, Shanxi Academy of Agricultural Sciences, Taiyuan 030031, China 2 Key Laboratory of Crop Gene Resources and Germplasm Enhancement on Loess Plateau, Ministry of Agriculture, Taiyuan 030031, China 3 Shanxi Key Laboratory of Genetic Resources and Genetic Improvement of Minor Crops, Taiyuan 030031, China 4 State Key Laboratory of Plant Genomics, Institute of Genetics and Developmental Biology, Chinese Academy of Sciences, Beijing 100101, China. 5 College of Life Sciences, University of Chinese Academy of Sciences, Beijing 100049, China. 6 College of Agronomy, Shanxi Agricultural University, Taiyuan 030801, China 7 State Key Laboratory of Microbial Resources, Institute of Microbiology, Chinese Academy of Sciences, Beijing, China 8 College of Marine Sciences, South China Agricultural University, Guangzhou 510642, China These authors contributed equally to this work. *Correspondence to: C.L. ([email protected]), Z.Q. ([email protected]), J.W. ([email protected]), H.L.

Upload: others

Post on 13-Mar-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1:  · Web viewThe x-axis indicates the number of times a k-mer occurs in the genome. The y-axis indicates how many different kinds of k-mers have a particular multiplicity. The three

Supplemental Information

The Tartary buckwheat genome provides insights into rutin biosynthesis and abiotic stress

tolerance

Lijun Zhang1,2,3‡, Xiuxiu Li4,5‡, Bin Ma4‡, Qiang Gao4‡, Huilong Du4,5‡, Yuanhuai Han2,3,6, Yan Li4,

Yinghao Cao4, Ming Qi4, Yaxin Zhu7, Hongwei Lu4,5, Mingchuan Ma1,2,3, Longlong Liu1,2,3, Jianping

Zhou1,2,3, Chenghu Nan1,2,3, Yongjun Qin1,2,3, Jun Wang8*, Lin Cui1,2,3*, Huimin Liu1,2,3*, Chengzhi

Liang4,5*, and Zhijun Qiao1,2,3*

1Institute of Crop Germplasm Resources Research, Shanxi Academy of Agricultural Sciences, Taiyuan

030031, China

2Key Laboratory of Crop Gene Resources and Germplasm Enhancement on Loess Plateau, Ministry of

Agriculture, Taiyuan 030031, China

3Shanxi Key Laboratory of Genetic Resources and Genetic Improvement of Minor Crops, Taiyuan

030031, China

4State Key Laboratory of Plant Genomics, Institute of Genetics and Developmental Biology, Chinese

Academy of Sciences, Beijing 100101, China.

5College of Life Sciences, University of Chinese Academy of Sciences, Beijing 100049, China.

6College of Agronomy, Shanxi Agricultural University, Taiyuan 030801, China

7State Key Laboratory of Microbial Resources, Institute of Microbiology, Chinese Academy of

Sciences, Beijing, China

8College of Marine Sciences, South China Agricultural University, Guangzhou 510642, China

‡These authors contributed equally to this work.

*Correspondence to: C.L. ([email protected]), Z.Q. ([email protected]), J.W.

([email protected]), H.L. ([email protected]), or L.C. ([email protected])

Page 2:  · Web viewThe x-axis indicates the number of times a k-mer occurs in the genome. The y-axis indicates how many different kinds of k-mers have a particular multiplicity. The three

Supplemental Figures

Supplemental Figure 1: The flowchart of the Tartary buckwheat (F. tartaricum) genome assembly process. Ft, F. tartaricum.

Page 3:  · Web viewThe x-axis indicates the number of times a k-mer occurs in the genome. The y-axis indicates how many different kinds of k-mers have a particular multiplicity. The three

Supplemental Figure 2: The k-mer frequency distribution curve (k-mer=17) of Illumina short reads of Tartary buckwheat genome. The x-axis indicates the number of times a k-mer occurs in the genome. The y-axis indicates how many different kinds of k-mers have a particular multiplicity. The three peaks located at a multiplicity of 138, 276, and 414 indicate that many of the k-mers may come from duplicated regions of high sequence similarity.

Page 4:  · Web viewThe x-axis indicates the number of times a k-mer occurs in the genome. The y-axis indicates how many different kinds of k-mers have a particular multiplicity. The three

A

Page 5:  · Web viewThe x-axis indicates the number of times a k-mer occurs in the genome. The y-axis indicates how many different kinds of k-mers have a particular multiplicity. The three

B

Supplemental Figure 3: Hi-C linkage density heat map of assembled contigs. The numbers on x-axis and y-axis indicate the number of contigs that are clustered. (A) The contigs were clustered into 8 chromosomes (2n=16). (B) The contigs were clustered into 16 clusters for testing purpose. Note that a lot of links were observable between the 16 clusters in (B), indicating that they could be connected further into larger clusters.

Page 6:  · Web viewThe x-axis indicates the number of times a k-mer occurs in the genome. The y-axis indicates how many different kinds of k-mers have a particular multiplicity. The three

Supplemental Figure 4: Alignment of hybrid genome maps to the assembled Tartary buckwheat pseudomolecules. The green and blue horizontal bars represent the pseudomolecules and genome maps, respectively. Vertical blue or red lines represent the aligned boundaries between the pseudomolecules and the genome maps. Unaligned regions in the genome maps indicated missing sequences in the pseudomolecules. These alignments showed the high quality of the pseudomolecules except the missing sequences not anchored to the pseudomolecules due to their small contig size.

Page 7:  · Web viewThe x-axis indicates the number of times a k-mer occurs in the genome. The y-axis indicates how many different kinds of k-mers have a particular multiplicity. The three

Supplemental Figure 5: Shared and unique gene families. Venn diagram representation of shared/unique gene families among F. tataricum, B. vulgaris, O. sativa, and A. thaliana. Only the numbers of gene families that have at least two members from either different species or the same species were shown here.

Page 8:  · Web viewThe x-axis indicates the number of times a k-mer occurs in the genome. The y-axis indicates how many different kinds of k-mers have a particular multiplicity. The three

Supplemental Figure 6: Intra-genome dot plot comparison of Tartary buckwheat showing the collinear blocks derived from ancient whole-genome duplications. Each block contains at least 3 collinear gene pairs. The dot plot was generated at CoGe site (https://genomevolution.org/r/ri2a). Note that blue-colored dots represent possibly ‘older’ syntenic blocks that originated from the WGD shared with Arabidopsis.

Page 9:  · Web viewThe x-axis indicates the number of times a k-mer occurs in the genome. The y-axis indicates how many different kinds of k-mers have a particular multiplicity. The three

A

https://genomevolution.org/r/rkguB

https://genomevolution.org/r/rkh4C

https://genomevolution.org/r/rkh2Supplemental Figure 7: Examples of collinear blocks within the Tartary buckwheat genome. (a) A block between chromosomes 1 and 3. (b) A block between chromosomes 2 and 8. (c) A block between chromosomes 3 and 7. Each dashed line represents a chromosome segment. The colored widgets repre-sent genes (red, homologous genes; green, non-homologous genes). The red cross lines indicate homol-ogous gene pairs. The images were generated at CoGe web site.

Page 10:  · Web viewThe x-axis indicates the number of times a k-mer occurs in the genome. The y-axis indicates how many different kinds of k-mers have a particular multiplicity. The three

Supplemental Figure 8: The phylogenetic tree of R2R3-MYB transcription factor family. The genes in Tartary buckwheat are labelled as orange and the genes in Arabidopsis as blue. The expression values in log(FPKM) of the Tartary buckwheat genes in five tissues are shown on right of the tree. Several subgroups that are not present in Arabidopsis are designated as ‘–like’ according to their branch vicinity to the closest named subgroup.

Page 11:  · Web viewThe x-axis indicates the number of times a k-mer occurs in the genome. The y-axis indicates how many different kinds of k-mers have a particular multiplicity. The three
Page 12:  · Web viewThe x-axis indicates the number of times a k-mer occurs in the genome. The y-axis indicates how many different kinds of k-mers have a particular multiplicity. The three

A

B

Supplemental Figure 9: Phylogenetic trees of ALMT (A) and MATE (B) families in Tartary buck-wheat, sugar beet, Arabidopsis, and rice showing the expansion of Tartary buckwheat genes.

Page 13:  · Web viewThe x-axis indicates the number of times a k-mer occurs in the genome. The y-axis indicates how many different kinds of k-mers have a particular multiplicity. The three

Supplemental Tables

Supplemental Table 1: Summary of Tartary buckwheat Pinku1 DNA sequencing data

Sequence type Platform LIS Run/SampleARL(bp)

RD(Gb)

Depth*

WGS Illumina HiSeq 2000 ~500 bp 2 2x100 28.4 ~56.8x

WGS Illumina HiSeq 2500 ~500 bp 2 2x125 22.3 ~44.6x

WGS Illumina MiSeq ~500 bp 3 2x250 11.1 ~22.2x

WGS Illumina HiSeq 2500 ~550 bp 2 2x259 25.6 ~51.2x

WGS PacBio SMRT P5-C3 ~20 kb 32 5,930 15.4 ~30.8x

Fosmid mate-pair Illumina HiSeq 2500 ~36 kb 2 2x125 12.4 ~24.8x

Fosmid GBS tags Illumina HiSeq 2500 ~350 bp 1728** 2x125 45.3 -

Hi-C Illumina HiSeq X 10 ~400 bp 1 2x150 97.5 ~195x

LIS: Library insert size; ARL: Average read length; RD: Raw data.*Based on a genome size of 500 Mb.**In total, 1,728 pools of fosmid clones were sequenced using GBS method to generate sequence tags of 2x125 bp. Each pool contained ~1000 fosmid clones, with an average insert size of 36 kb.

Supplemental Table 2: Estimate of Tartary buckwheat genome size.N L K B D G (Genome Size)

503,829,482 126.4796 17 1,537,299,651 111 487,617,641

503,829,482 126.4796 19 2,409,210,983 107 488,280,260

503,829,482 126.4796 21 2,803,126,220 105 484,232,511

503,829,482 126.4796 23 3,045,499,920 102 486,219,422

503,829,482 126.4796 25 3,242,088,249 100 483,901,337

503,829,482 126.4796 27 3,415,530,262 97 486,691,059

503,829,482 126.4796 29 3,572,645,712 95 484,676,403

503,829,482 126.4796 31 3,708,214,479 93 482,806,781

G = (N × (L – K + 1) – B) / D,where N is the total number of reads, L is the average length of reads, K is k-mer length, B is the number of k-mers of low frequency (prior to the first valley before the first peak) that were discarded, G is the genome size, and D is the overall depth estimated from k-mer distribution (pkdepth).

Page 14:  · Web viewThe x-axis indicates the number of times a k-mer occurs in the genome. The y-axis indicates how many different kinds of k-mers have a particular multiplicity. The three

Supplemental Table 3: Summary of the Tartary buckwheat genome assembly.

  

Contig Scaffold** Hybrid Scaffold***

Size (bp) Number Size (bp) Number Size (bp) Number

N95 12,170 2,487 13,170 1,564 1,121,436 82

N90 41,660 1,233 95,265 578 1,708,292 66

N80 154,621 688 348,740 322 2,965,902 46

N70 276,512 455 592,742 213 4,631,247 34

N60 398,514 307 852,538 143 6,305,765 25

N50 550,707 203 1,178,148 93 7,467,744 19

Min 1,000 - 1,000 - 176,878 -

Max 6,643,492 - 13,795,379 - 18,331,331 -

Total* 489,315,172 8,778 498,723,462 7,788 453,060,345 114

*GC content: 37.8%**Scaffolds were obtained by linking the contigs using fosmid mate-paired reads.***Hybrid scaffolds were obtained by connecting the contigs with genome maps; contigs not connected by maps were not counted.

Supplemental Table 4: Summary of Tartary buckwheat pseudomolecules.

Chr Length (bp)Gene num-

berGap base

(N)N (%)

Ft1 68,031,765 5,280 1,747,090 2.57

Ft2 61,235,386 3,971 2,561,865 4.18

Ft3 57,706,077 4,320 1,443,627 2.50

Ft4 56,655,744 3,214 2,326,312 4.11

Ft5 53,883,329 3,601 2,218,410 4.12

Ft6 52,287,906 3,417 1,779,172 3.40

Ft7 51,545,819 3,654 1,319,936 2.56

Ft8 49,982,843 3,708 1,571,964 3.15

Total 451,328,869 31,165 14,968,376 3.32

Supplemental Table 5: The assembled BioNano genome maps of Tartary buckwheat.See additional Excel file.

Page 15:  · Web viewThe x-axis indicates the number of times a k-mer occurs in the genome. The y-axis indicates how many different kinds of k-mers have a particular multiplicity. The three

Supplemental Table 6: Summary of Tartary buckwheat cv. Pinku1 RNA-seq dataPlatform Organ ARL (bp) RD (Gb) Mapping rate (%)Illumina HiSeq2500 Root 2x144 15.1 89.03

Illumina HiSeq2500 Yong seed 2x144 15.2 93.66

Illumina HiSeq2500 Flower 2x144 14.8 92.86

Illumina HiSeq2500 Yong stem 2x144 14.2 93.28

Illumina HiSeq2500 Leaf 2x144 16.1 94.44

Total - - 75.4 92.65ARL: Average read length; RD: Raw data. Mapping rate, percentage of reads mapped to genome.

Supplemental Table 7: Estimate of error rates in the Pinku1 assemblySequence assem-bly

Sample DNA/RNA

SNP IndelCovered length

(bp)All length

(bp)Error

rate

SMRT assemblyIllumina WGS

28,073 152,325 467,607,623 471,887,7730.039%

*

Illumina assemblyIllumina WGS

1,012 259 18,054,964 18,079,980 0.007%

Whole genome Root RNA 7,346 13,302 235,548,356 489,315,172 0.009%

Whole genomeYoung seed RNA

6,127 12,857 239,134,927 489,315,172 0.008%

Whole genome Leaf RNA 6,125 13,468 259,662,409 489,315,172 0.008%

Whole genomeFlower RNA

5,898 12,068 209,527,988 489,315,172 0.009%

Whole genomeYoung stem RNA

7,733 12,995 224,203,066 489,315,172 0.009%

* The majority of the errors come from small indels in non-genic regions. Many GC-biased regions could not be sequenced by Illumina platform, which resulted in uncovered regions on the reference genome by Illumina short read data.

Supplemental Table 8: Sequence coverage of known Tartary buckwheat genes by the assembled Pinku1 genomeGenBank ID mRNA

lengthCovered length

Coverage Identity

GU169468.1

1026 1026 100.00% 100.00%

HM852753.1

1188 1188 100.00% 100.00%

HQ003249.1

599 599 100.00% 99.33%

HQ003253.1

830 830 100.00% 99.04%

HQ829975.1

450 450 100.00% 99.11%

Page 16:  · Web viewThe x-axis indicates the number of times a k-mer occurs in the genome. The y-axis indicates how many different kinds of k-mers have a particular multiplicity. The three

HQ833211.1

210 210 100.00% 100.00%

HQ833212.1

231 231 100.00% 96.10%

JF274262.1 1008 1008 100.00% 99.90%

JF313345.1 711 711 100.00% 100.00%

JF313347.1 759 759 100.00% 99.87%

JF313349.1 624 624 100.00% 99.52%

JN605357.1 843 843 100.00% 99.76%

JN605358.1 745 745 100.00% 99.87%

KC404848.1 1240 1240 100.00% 100.00%

KC404850.1 1524 1524 100.00% 99.61%

KC404851.1 1555 1555 100.00% 100.00%

KC404853.1 1173 1173 100.00% 99.66%

KC417045.1 264 264 100.00% 95.83%

KC571228.1 1093 1093 100.00% 99.18%

KC571230.1 309 309 100.00% 100.00%

KC571231.1 379 379 100.00% 100.00%

KC571232.1 939 939 100.00% 99.89%

KC571234.1 448 448 100.00% 98.66%

KC571235.1 855 855 100.00% 100.00%

KC571237.1 1134 1134 100.00% 100.00%

KJ139980.1 1110 1110 100.00% 100.00%

KJ586579.1 516 516 100.00% 100.00%

KM588379.1

876 876 100.00% 99.66%

KM588380.1

912 912 100.00% 99.45%

KM658320.1

750 750 100.00% 99.73%

KM658321.1

750 750 100.00% 99.60%

KP252134.1 528 528 100.00% 100.00%

KP260662.1 444 444 100.00% 100.00%

KR072701.1 699 699 100.00% 100.00%

KT284884.1 1179 1179 100.00% 100.00%

KT284885.1 1182 1182 100.00% 100.00%

KT285529.1 777 777 100.00% 100.00%

KT285530.1 804 804 100.00% 99.88%

KT285531.1 855 855 100.00% 100.00%

KT285532.1 819 819 100.00% 100.00%

KT285533.1 798 798 100.00% 100.00%

Page 17:  · Web viewThe x-axis indicates the number of times a k-mer occurs in the genome. The y-axis indicates how many different kinds of k-mers have a particular multiplicity. The three

KT285534.1 1116 1116 100.00% 100.00%

KT285535.1 1095 1095 100.00% 100.00%

KT285536.1 804 804 100.00% 100.00%

KT359569.1 1059 1059 100.00% 100.00%

KT737454.1 1761 1761 100.00% 99.38%

KT737455.1 1536 1536 100.00% 99.67%

KU162971.1

951 951 100.00% 100.00%

KX059426.1

1035 1035 100.00% 99.71%

KX216512.1

1341 1341 100.00% 99.85%

KX216513.1

1500 1500 100.00% 99.67%

KX216514.1

1413 1413 100.00% 100.00%

KX262908.1

1434 1434 100.00% 100.00%

KX262909.1

1470 1470 100.00% 100.00%

KY643823.1

237 237 100.00% 100.00%

KJ130961.1 1430 1429 99.93% 100.00%

KU296217.1

957 955 99.79% 100.00%

KC571229.1 840 838 99.76% 100.00%

KM362863.1

2007 2002 99.75% 99.95%

KC571236.1 451 449 99.56% 100.00%

KP303383.1 942 937 99.47% 100.00%

DQ849083.1

1758 1747 99.37% 98.45%

EU715255.1 1463 1453 99.32% 100.00%

KF955601.1 3398 3364 99.00% 99.67%

FJ456858.1 1369 1352 98.76% 99.78%

KF938585.1 1644 1623 98.72% 99.82%

KM271986.1

1770 1745 98.59% 99.77%

JX500528.1 1757 1732 98.58% 99.42%

AY044918.1

771 760 98.57% 98.07%

HQ828144.1

294 289 98.30% 97.92%

Page 18:  · Web viewThe x-axis indicates the number of times a k-mer occurs in the genome. The y-axis indicates how many different kinds of k-mers have a particular multiplicity. The three

GU984045.1

2052 2014 98.15% 99.80%

HQ003250.1

697 682 97.85% 98.68%

HQ840696.1

834 812 97.36% 99.75%

KC404849.1 1581 1538 97.28% 96.94%

GU985519.1

1185 1150 97.05% 99.83%

HM587134.1

669 649 97.01% 99.23%

HQ654088.1

760 736 96.84% 99.86%

JX401285.1 1283 1240 96.65% 96.77%

GU388434.1

594 574 96.63% 99.65%

HQ829976.1

295 285 96.61% 98.95%

HQ844967.1

494 477 96.56% 97.69%

HQ003254.1

517 499 96.52% 99.40%

KC571233.1 599 578 96.49% 99.31%

KC571227.1 747 717 95.98% 99.44%

HQ844968.1

369 352 95.39% 98.01%

HQ828145.1

1135 1080 95.15% 93.76%

HQ003251.1

507 482 95.07% 99.59%

HQ833213.1

218 203 93.12% 98.52%

AY335159.1

328 299 91.16% 100.00%

JF769134.1 365 317 86.85% 99.05%

HQ003252.1

1340 1105 82.46% 99.28%

HQ202162.1

189 150 79.37% 100.00%

HQ844965.1

264 183 69.32% 98.91%

HQ844964.1

326 167 51.23% 96.41%

Page 19:  · Web viewThe x-axis indicates the number of times a k-mer occurs in the genome. The y-axis indicates how many different kinds of k-mers have a particular multiplicity. The three

HQ844966.1

421 198 47.03% 95.96%

Supplemental Table 9: Features of the Tartary buckwheat genomeFeature type Number

(median/average)

Gene number 33,366

Max gene length (bp) 75,902

Min gene length (bp) 200

Max cds length (bp) 13,896

Gene length (bp) 1,838/2,582

mRNA length (bp) 1,006/1,187

CDS length (bp) 720/935

Protein length (AA) 239/311

Exon length (bp) 154/244

Intron length (bp) 132/355

5' UTR length (bp) 86/154

3' UTR length (bp) 114/174

Exon number per transcript 3.0/4.9

Transcript number per gene 1.0/1.1

Supplemental Table 10: Functional annotation of the predicted Tartary buckwheat genes.

Number Percent (%)

InterPro 25,527 76.51

GO 18,947 56.79

Pathway 4,554 13.65

Pfam 24,301 72.83

Homologous*

21,667 64.94

Annotated 32,775 98.23

Unannotated 651 1.77

Total 33,366 100

* Compared with Arabidopsis, tomato, potato, and sugar beet.

Supplemental Table 11: Mapping summary of RNA-seq data to the Tartary buckwheat genesSample Expressed gene # Unexpressed gene # Genome coverage

Root 26,125 7,241 48.07%

Young seed 25,835 7,531 48.81%

Leaf 26,614 6,752 53.00%

Flower 23,492 9,874 42.76%

Page 20:  · Web viewThe x-axis indicates the number of times a k-mer occurs in the genome. The y-axis indicates how many different kinds of k-mers have a particular multiplicity. The three

Yong stem 25,552 7,814 45.76%

Total (≥1) 28,681 4,685 71.42%

Supplemental Table 12: Classification of repetitive DNA in the Tartary buckwheat genomeRepeat type Length (bp) Percentage of genome

Class I:Retrotransposons 213,719,047 43.68%

LTR-Retrotransposons 189,333,482 38.69%

LTR/Copia 27,318,301 5.58%

LTR/Gypsy 149,363,161 30.52%

LTR-other 12,652,020 2.59%

Non-LTR Retrotransposons 24,385,565 4.98%

SINE 2,270,246 0.46%

LINE 22,115,319 4.52%

Class II:DNA Transposons 10,460,677 2.14%

EnSpm/CACTA 1,266,179 0.26%

Harbinger 1,084,318 0.22%

Helitron 797,777 0.16%

MuDR 3,309,724 0.68%

Tcl/Mariner 443,588 0.09%

hAT 2,958,292 0.60%

DNA-other 600,799 0.12%

Low Complexity 1,367,790 0.28%

Tandem repeats 9,354,311 1.91%

Unclassified 13,982,749 2.85%

Total content 249,371,389 50.96%

Supplemental Table 13: Summary of predicted non-coding RNAs in Tartary buckwheat.Type Copy

numberAverage

length (bp)Total

length (bp)miRNA 278 125 34,858

tRNA 1,395 76 105,818

rRNA

18S 212 2,004 424,838

28S 34 3,102 105,455

5.8S 164 147 24,106

5S 45 122 5,474

snRNA

CD-box 305 111 33,713

HACA-box 67 113 8,925

Splicing 146 141 20,553

Page 21:  · Web viewThe x-axis indicates the number of times a k-mer occurs in the genome. The y-axis indicates how many different kinds of k-mers have a particular multiplicity. The three

Supplemental Table 14: Statistics of gene families of Tartary buckwheat and other 10 speciesSpecies Gene Number Gene in family Unclustered Gene All Family num Unique Family Single Copy Family Average genes per family

A. thaliana 27,416 23,466 3,950 13,553 624 9,724 1.73

B. vulgaris 26,899 19,013 7,886 13,403 563 11,106 1.42

T. cacao 46,143 39,011 7,132 15,861 1,281 12,181 2.46

G. max 54,175 44,973 9,202 15,847 1,444 5,558 2.84

O. sativa 55,798 42,657 13,141 16,929 1,844 12,096 2.52

P. trichocarpa 41,335 33,428 7,907 15,486 827 8,307 2.16

S. lycoper-sicum

34,727 26,193 8,534 17,086 541 13,080 1.53

S. tuberosum 35,119 28,570 6,549 16,146 637 12,476 1.77

V. vinifera 26,346 19,498 6,848 13,287 633 10,478 1.47

Z. mays 36,439 28,323 8,116 15,692 1,318 10,654 1.80

F. tataricum 33,366 24,242 9,124 13,660 1,025 9,283 1.77

Unclustered gene: species-specific genes that are not assigned to any families.Unique family: species-specific paralogous gene families.

Page 22:  · Web viewThe x-axis indicates the number of times a k-mer occurs in the genome. The y-axis indicates how many different kinds of k-mers have a particular multiplicity. The three

Supplemental Table 15: GO enrichment of Tartary buckwheat-specific gene families.

GO ID Description GO classGene

frequency*

q valueFamily

frequency

GO:0015074

DNA integrationbiological_process 87/312

2.09E-26 33

GO:0006952

defense responsebiological_process 22/98

6.36E-05 7

GO:0009664

plant-type cell wall organization biological_process 8/30

2.16E-02 4

GO:0003676

nucleic acid bindingmolecular_function 220/1356

6.40E-36 73

GO:0016758

transferase activity, transferring hexosyl groups

molecular_function 34/222

1.68E-04 13

GO:0043531

ADP bindingmolecular_function 19/89

3.92E-04 9

GO:0016747

transferase activity, transferring acyl groups other than amino-acyl groups

molecular_function 26/164

8.04E-04 11

GO:0045735

nutrient reservoir activity

molecular_function 14/55

8.04E-04 3

GO:0030145

manganese ion bindingmolecular_function 14/58

1.16E-03 3

GO:0016705

oxidoreductase activity, acting on paired donors, with incorporation or reduction of molecular oxygen

molecular_function 37/304

2.48E-03 12

GO:0016788

hydrolase activity, acting on ester bonds

molecular_function 27/192

2.48E-03 8

GO:0010333

terpene synthase activitymolecular_function 10/50

3.31E-02 5

GO:0005506

iron ion bindingmolecular_function 41/414

3.62E-02 14

GO:0008171

O-methyltransferase activity

molecular_function 9/43

3.63E-02 2

*Tartary buckwheat-specific genes / all genes in each GO category.

Supplemental Table 16: Disease resistance genes.Species NBS-encoding RLP RLK

Page 23:  · Web viewThe x-axis indicates the number of times a k-mer occurs in the genome. The y-axis indicates how many different kinds of k-mers have a particular multiplicity. The three

NBSCC-NBS-

LRRTIR-NBS-

LRRCC-NBS

TIR-NBS

NBS-LRR

Total

F. tataricum

48 16 0 6 0 52 122 83 493

A. thaliana 5 39 77 1 13 21 156 75 514

B. vulgaris 22 51 1 12 1 50 137 56 382

O. sativa 49 228 0 40 3 200 520 122 881

Abbreviations: NBS: nucleotide-binding site; CC: coiled-coil; LRR: leucine rich repeat; TIR: Toll/In-terleukin-1 receptor; RLK: receptor like kinase; RLP: receptor like protein.Genes were identified by the RGAugury pipeline (Li et al., 2016).

Page 24:  · Web viewThe x-axis indicates the number of times a k-mer occurs in the genome. The y-axis indicates how many different kinds of k-mers have a particular multiplicity. The three

Supplemental Table 17: Genes in rutin biosynthetic pathwaySee additional Excel file.

Supplemental Table 18: Expression value (FPKM) of genes involved in rutin biosynthesis and MYB genes.See additional Excel file.

Supplemental Table 19: Tartary buckwheat genes homologous to known Al resistance genesSee additional Excel file.

Supplemental Table 20: Differentially expressed genes in root tips after exposure to Al treatment for 6 hoursSee additional Excel file.

Supplemental Table 21: Expression value (FPKM) of ALMT genes and MATE genes.See additional Excel file.

Page 25:  · Web viewThe x-axis indicates the number of times a k-mer occurs in the genome. The y-axis indicates how many different kinds of k-mers have a particular multiplicity. The three

Supplemental Table 22: Summary of MYB transcription factor genes present in the Tartary buckwheat, sugar beet, Arabidopsis, and rice genomes.Category Subgroup Tartary buckwheat Sugar beet Arabidopsis Rice

R2R3

SG1 5 3 5 7

SG2 3 2 3 5

SG3 0 0 2 0

SG4 5 3 4 5

SG4-like 5 1 0 0

SG5 6 1 1 2

SG6 3 0 4 0

SG7 2 1 3 2

SG8 4 0 2 0

SG9 1 1 2 3

SG10 1 2 2 0

SG11 3 2 3 1

SG12 0 0 3 0

SG13 5 1 4 8

SG14 8 5 6 9

SG15 5 0 4 0

SG16 2 1 3 3

SG17 2 1 3 3

SG18 8 2 6 5

SG19 2 1 2 0

SG20 5 2 6 7

SG21 7 3 7 4

SG22 6 2 4 1

SG22-like 5 0 0 0

SG23 3 1 3 1

SG24 3 2 3 3

SG25 6 2 7 5

SG25-like 3 0 0 0

Unclassified 41 17 35 31

Total (R2R3) 149 56 127 105

R1R2R3 8 3 5 5

4R 1 1 1 0

Other 12 7 9 7

Total 170 67 142 117

Page 26:  · Web viewThe x-axis indicates the number of times a k-mer occurs in the genome. The y-axis indicates how many different kinds of k-mers have a particular multiplicity. The three

Supplemental Methods

Sequencing

We sequenced the Tartary buckwheat genome of Pinku1 (F. tataricum) on the Illumina platform using the pair-end method (2x100 bp to 2x250 bp) to a total of 87.4 Gb (~175x) for whole-genome shotgun sequencing (WGS) on multiple PCR-free libraries, and on the Pacific Biosciences (PacBio) platform using P5-C3 chemistry for single-molecule, real time (SMRT) sequencing to a total of 15.4 Gb (31x) from 32 SMRT cells, with an average subread length of 5.9 kb and a N50 size of 8 kb (Supplemental Table 1). We also obtained 12.4 Gb (24.8x) of fosmid mate-pair sequencing data (2x125 bp), and GBS tags of 1,728 pools of fosmid clones (~1000 clones per pool). We conducted RNA-seq on 5 tissue samples yielding a total of 75.4 Gb paired-end short reads of 2x125 bp (Supplemental Table 2).

Genome assembly

The short reads and long reads were assembled separately. We used the MaSuRCA pipeline (http://www.genome.umd.edu/masurca.html) to assemble the Illumina short reads into 550.4 Mb of sequence contigs with a N50 size of 42.8 kb. For the PacBio long reads, we obtained 9.1 Gb of self-corrected sequences using the PBcR pipeline (http://wgs-assembler.sourceforge.net/wiki/index.php/PBcR), and then assembled them into 482.2 Mb of contig sequences with a N50 size of 212.2 kb. The PBcR assembly was corrected with Quiver (https://github.com/PacificBiosciences/GenomicConsensus), and compared to the short read assembly. Comparison of the two assemblies showed >95% of the short read contigs were covered by a long read contig at minimum sequence identity of 95%. This gives an indication to the near completeness of the long read assembly, despite the low sequencing depth ~30x of the SMRT reads. We then used Illumina short reads to polish the assembly by identifying and fixing small indels of 1-2 bp in the homopolymer runs. The corrected genome size is 483 Mb.

We aligned each pool of fosmid GBS tags to the corrected SMRT sequences with BWA-mem (http://bio-bwa.sourceforge.net/) to retrieve the best aligned reads (minimum alignment identity of >=96%). Only the Illumina short reads aligned in full length to the SMRT reads, or aligned at least 50% of their length to the ends of the SMRT reads were used to compute the coverage of each SMRT read. Approximately 15-20x (500-800 Mb) of corrected SMRT reads with the highest coverage by the GBS tags were selected for assembly with Falcon (https://github.com/PacificBiosciences/falcon, v0.3.0), and contigs shorter than 20 kb were discarded. The assembled fosmid contigs had a total length of 53 Gb, with a N50 size of 38 kb.

The fosmid contigs were used to connect the PBcR contigs to generate super-contigs. First, we used fosmid mate-pair sequences to scaffold the PBcR contigs with SSPACE (https://wiki.gacrc.uga.edu/wiki/SSPACE). We then aligned the fosmid contigs to the scaffolds with BWA-mem. If a gap was covered by a fosmid contig, with minimum threshold of sequence alignment identity of 97% and overlap length of 1 kb, it was filled using the fosmid contig sequences. The scaffolds with gaps that were not filled by fosmid contigs were cut back into contigs, and the scaffolding and gap filling were redone iteratively until no more gaps could be filled. After this stringent gap filling we obtained an assembly of 478.5 Mb with a contig N50 size of 350 kb. Next, we aligned all fosmid contigs to the connected super-contigs, and the remaining contigs, using BWA-mem. These contigs and super-contigs were further connected iteratively using the fosmid contigs with a minimum overlap length of 10 kb, 5kb, and 2kb; and minimum alignment identity 97%, 98%, and 99%, respectively (at least 3 fosmid contigs supported each connection), with the method described by Du et al (2017). After connection, we obtained a genome assembly of 480.8 Mb with a N50 size of 586 kb.

The connected contigs included a super-contig of 251 kb, consisting of 19 small WGS contigs, which contained a circular chloroplast DNA of 159,268 bp, that shared 99.23% sequence identity to a known Tartary buckwheat cpDNA (GenBank accession NC_027161.1), suggesting the connecting process worked well. Another super-contig of 489,827 bp was found to be a partial mitochondrial DNA (could not be circularized), that could be aligned with 80% of predicted mitochondrial proteins in sugar beet using BLAST. The Illumina short reads were aligned to these connected contigs to verify their identity. The mtDNA was found to have a very high sequencing depth (1148x vs. the whole genome average of 100x) in the Illumina short reads.

Finally, we added a total of 8.5 Mb of MaSuRCA contigs, with a sequence identity <99% to themselves

Page 27:  · Web viewThe x-axis indicates the number of times a k-mer occurs in the genome. The y-axis indicates how many different kinds of k-mers have a particular multiplicity. The three

and a sequence identity <98% to the connected PBcR contigs, to generate the final assembly. The final genome size was 489.3 without cpDNAs/mtDNAs or contaminants. We used fosmid mate-pair sequences to scaffold the contigs again with SSPACE and used the Hi-C data to cluster the scaffolds onto chromosomes with LACHESIS (http://shendurelab.github.io/LACHESIS/). In total, 451.4 Mb (92.22%) of the assembled sequences were clustered onto 8 chromosomes to form pseudomolecules with a maximum length of 68 Mb and a minimum length of 50 Mb (Supplemental Table 4). We found that 92.65% of the RNA-seq short reads from analysis of five developmentally distinct Tartary buckwheat tissues could be aligned to the assembled sequences (Supplemental Table 6). These data suggested that the genome assembly of Pinku1 was nearly complete, especially considering that RNA-seq data usually contain some contaminating sequences. We also used known genes in Tartary buckwheat to confirm these results (Supplemental Table 8). We used hybrid BioNano genome maps to validate the order of the contigs on the pseudomolecules, which indicated that although there were large sequence gaps in the pseudomolecules, the order of the contigs were almost all correct (Supplemental Fig. 4).

ReferencesLi, P., Quan, X., Jia, G., Xiao, J., Cloutier, S., and You, F. M. (2016). RGAugury: a pipeline for

genome-wide prediction of resistance gene analogs (RGAs) in plants. BMC Genomics 17:852.