discovery of genes for improved cellulose and cellulose-extractability from poplar secondary xylem...
TRANSCRIPT
![Page 1: Discovery of Genes for Improved Cellulose and Cellulose-Extractability from Poplar Secondary Xylem Jill L Wegrzyn 1, Jennifer M. Lee 2, Andrew J. Eckert](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5697bfc11a28abf838ca423e/html5/thumbnails/1.jpg)
Discovery of Genes for Improved Cellulose and Cellulose-Extractability from Poplar Secondary Xylem
Jill L WegrzynJill L Wegrzyn11, Jennifer M. Lee, Jennifer M. Lee22, Andrew J. Eckert, Andrew J. Eckert22, Charlyn J. Suarez, Charlyn J. Suarez22
Brian J. StantonBrian J. Stanton33, Mark F. Davis, Mark F. Davis44, Chung-Jui Tsai, Chung-Jui Tsai55, David B. Neale, David B. Neale11
11Department of Plant Sciences, University of California at Davis, Davis, CADepartment of Plant Sciences, University of California at Davis, Davis, CA22Department of Evolution and Ecology, University of California at Davis, Davis, CADepartment of Evolution and Ecology, University of California at Davis, Davis, CA
33Genetic Resources Conservation Program, Greenwood Resources, Portland, ORGenetic Resources Conservation Program, Greenwood Resources, Portland, OR44National Renewable Energy Lab, Golden, CONational Renewable Energy Lab, Golden, CO
55School of Forest Resouces, Michigan Technical University, Hougton, MISchool of Forest Resouces, Michigan Technical University, Hougton, MI
![Page 2: Discovery of Genes for Improved Cellulose and Cellulose-Extractability from Poplar Secondary Xylem Jill L Wegrzyn 1, Jennifer M. Lee 2, Andrew J. Eckert](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5697bfc11a28abf838ca423e/html5/thumbnails/2.jpg)
Project Objectives• Resequence 40 candidate genes using a discovery
panel of 15 unrelated poplar individuals• Identify SNPs in the 40 genes using an automated
alignment and SNP calling bioinformatics pipeline• SNP genotype 456 poplar clones for 1536 SNPs
(Illumina Golden Gate assay)• Harvest wood increment cores from 2-3 ramets of
each of the 456 poplar clones (1100 trees in total)• Molecular Beam Mass Spectrometry (MBMS)
analysis on all 1100 wood cores to develop secondary xylem metabolomic profiles
• Association genetics analyses to identify genes controlling cellulose quantity and quality phenotypic variation in poplar
![Page 3: Discovery of Genes for Improved Cellulose and Cellulose-Extractability from Poplar Secondary Xylem Jill L Wegrzyn 1, Jennifer M. Lee 2, Andrew J. Eckert](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5697bfc11a28abf838ca423e/html5/thumbnails/3.jpg)
Poplar Biofuels Genome ProjectProject Overview
![Page 4: Discovery of Genes for Improved Cellulose and Cellulose-Extractability from Poplar Secondary Xylem Jill L Wegrzyn 1, Jennifer M. Lee 2, Andrew J. Eckert](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5697bfc11a28abf838ca423e/html5/thumbnails/4.jpg)
Gene Family Gene Names phenylalanine ammonia-lyase (PAL) PAL2, PAL4, PAL5 cinnamate 4-hydroxylase (C4H) C4H1, C4H2 4-coumarate:CoA ligase (4CL) 4CL1, 4CL3, 4CL5 hydroxycinnamoyl-CoA quinate/shikimate hydroxycinnamoyltransferase (HCT) HCT1, HCT6 coumarate 3-hydroxylase (C3H) C3H3 ferulate 5-hydroxylase (F5H) F5H1, F5H2 caffeate O-methyltransferase (COMT) COMT1, COMT2 caffeoyl CoA O-methyltransferase (CCoAOMT) CCoAOMT1, CCoAMT2 cinnamoyl-CoA reductase (CCR) CCR cinnamyl alcool dehydrogenase (CAD) CAD laccase (LAC) LAC1a, LAC2, LAC90a alpha-tubulin (TUA) TUA1, TUA5 beta-tubulin (TUB) TUB15, TUB9, TUB16 cellulose synthase (CesA) CesA1A, CesA2A, CesA1B, CesA2B, CesA3A
sucrose synthase (SUSY) SUSY1 cellulase (KOR) KOR1 glycine decarboxylase complex, H subunit (gdcH) gdcH1 glycine decarboxylase complex, T subunit (gdcT) gdcT2 S-adenosylmethionine synthetase (SAMS) SAMS1 Serine hydroxymethyltransferase (SHMT) SHMT1, SHMT3, SHMT6
Selected Candidate Genes40 Genes highly expressed in wood-forming tissues and associated
with lignin and cellulose biosynthesis
![Page 5: Discovery of Genes for Improved Cellulose and Cellulose-Extractability from Poplar Secondary Xylem Jill L Wegrzyn 1, Jennifer M. Lee 2, Andrew J. Eckert](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5697bfc11a28abf838ca423e/html5/thumbnails/5.jpg)
Genomic sequences of ~179,000 bp covering the entire protein-coding regions, including introns, and 1,000 bp upstream and 300 bp downstream, were retrieved from JGI
![Page 6: Discovery of Genes for Improved Cellulose and Cellulose-Extractability from Poplar Secondary Xylem Jill L Wegrzyn 1, Jennifer M. Lee 2, Andrew J. Eckert](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5697bfc11a28abf838ca423e/html5/thumbnails/6.jpg)
Primer Design and SequencingAgencourt Biosciences
Primer Design:
mRNA sequences were used to direct custom software to use 1000 bp upstream along with intronic sequence from the poplar genome
517 primers were designed across 40 genes
203 non-overlapping primers were finally selected based on: quality score, position, homopolymer regions (bioinformatic validation)
Goal: Fully re-sequence 40 candidate genes to facilitate SNP discovery~ between 3 and 12 amplicons/gene~ total of 202 amplicons from Agencourt~ forward and reverse sequencing
![Page 7: Discovery of Genes for Improved Cellulose and Cellulose-Extractability from Poplar Secondary Xylem Jill L Wegrzyn 1, Jennifer M. Lee 2, Andrew J. Eckert](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5697bfc11a28abf838ca423e/html5/thumbnails/7.jpg)
Candidate Genes Re-Sequenced from a Panel of 15 Unrelated Poplar Clones
DNA landmarks responsible for extraction
![Page 8: Discovery of Genes for Improved Cellulose and Cellulose-Extractability from Poplar Secondary Xylem Jill L Wegrzyn 1, Jennifer M. Lee 2, Andrew J. Eckert](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5697bfc11a28abf838ca423e/html5/thumbnails/8.jpg)
• AlignmentCritical in the automation of base calls– Commonly used Phrap (from PhredPhrap) is an assembler and is NOT ideal for
alignments
– Many commonly used aligners work best with protein sequences or with a reference sequence
– Preservation of quality scores for input into SNP identification programs– Speed for high-throughput programs
• Automated SNP Calls- Reference Sequence Required- Traditional approaches without reference sequence include “eSNPs” (human,
maize, and pine) -Very little redundancy outside of abundant genes-Overall high number of false positives (single pass reads)
- Not specific to frequencies observed in different organisms- High number of false positives in currently accepted methods
- Polybayes & PolyPhred
Alignment and SNP Calling PipelineChallenges in High-Throughput SNP Identification
![Page 9: Discovery of Genes for Improved Cellulose and Cellulose-Extractability from Poplar Secondary Xylem Jill L Wegrzyn 1, Jennifer M. Lee 2, Andrew J. Eckert](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5697bfc11a28abf838ca423e/html5/thumbnails/9.jpg)
Re-Sequencing datafrom Agencourt:
Initial Processing
Base Calling
Sequence Alignment
SNP Identification
Machine Learning
Data Storage & Release
Identification of SNPs in the 40 Candidate GenesAutomated Alignment and SNP identification Pipeline
![Page 10: Discovery of Genes for Improved Cellulose and Cellulose-Extractability from Poplar Secondary Xylem Jill L Wegrzyn 1, Jennifer M. Lee 2, Andrew J. Eckert](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5697bfc11a28abf838ca423e/html5/thumbnails/10.jpg)
Base Calling and Sequence Alignment
Modified PhredPhrapallows for trimming of bases from start and end of sequence based on trace quality
Ace2FASTAConverts native PhredPhrap output (ace file) into an unaligned FASTA file
ProbconsRNAOptimal DNA sequence alignment program
AlignedContig2ReadFASTAProvides single multifasta file with all reads aligned to the contig from PhredPhrap AND the contigs alignment to the other contigs from probconsRNA
FASTA2AceConverts resulting FASTA file back into ace file for SNP Identification
![Page 11: Discovery of Genes for Improved Cellulose and Cellulose-Extractability from Poplar Secondary Xylem Jill L Wegrzyn 1, Jennifer M. Lee 2, Andrew J. Eckert](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5697bfc11a28abf838ca423e/html5/thumbnails/11.jpg)
• Examine features to improve the accuracy of SNP location prediction
• Utilize machine learning to apply the features
• Refine the accuracy of the learning algorithm through adjustments to feature representation
• Utilize the classifier against the large re-sequenced set to improve accuracy of SNP calls originating from Polybayes and Polyphred
Alignment and SNP IdentificationSNP Identification Overview
![Page 12: Discovery of Genes for Improved Cellulose and Cellulose-Extractability from Poplar Secondary Xylem Jill L Wegrzyn 1, Jennifer M. Lee 2, Andrew J. Eckert](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5697bfc11a28abf838ca423e/html5/thumbnails/12.jpg)
• Polyphred: • http://droog.mbt.washington.edu/PolyPhred.html
– PolyPhred identifies potential SNPs using the base calls and peak information provided by Phred and the sequence alignments provided by Phrap
– SNP score based on base quality and sequence depth• Polybayes:• http://genome.wustl.edu/tools/software/polybayes.cgi
- Fully probabilistic SNP detection algorithm that identifies SNPs based on discrepancies at a given location of a multiple alignment.
- SNP score is based on a Bayesian-statistical formulation and can take-in prior frequency information
Alignment and SNP IdentificationExisting SNP Identification Software
![Page 13: Discovery of Genes for Improved Cellulose and Cellulose-Extractability from Poplar Secondary Xylem Jill L Wegrzyn 1, Jennifer M. Lee 2, Andrew J. Eckert](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5697bfc11a28abf838ca423e/html5/thumbnails/13.jpg)
Description Representation
Sequence Depth Continuous
Variation Type Categorical
Polybayes Score Continuous
Polyphred Score Continuous
Freq of major/minor alleles Continuous
Max quality of major/minor alleles Continuous
Local average quality Continuous
Overall average quality Continuous
Alignment Quality Continuous
Alignment and SNP IdentificationFeature Selection
![Page 14: Discovery of Genes for Improved Cellulose and Cellulose-Extractability from Poplar Secondary Xylem Jill L Wegrzyn 1, Jennifer M. Lee 2, Andrew J. Eckert](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5697bfc11a28abf838ca423e/html5/thumbnails/14.jpg)
• Sequence depth - Count of number of sequences in the alignment at the position of variation. – All sequences in the alignment may not overlap at the position of variation;
number is different from the total number of the sequences in the alignment
• Variation type– Variation type can be SNP or INDEL.
• PolyBayes score– PolyBayes program assigns a Bayesian posterior probability value for each called
SNP using the frequency priors given for observing a variation at that position.
Alignment and SNP IdentificationFeature Representation
![Page 15: Discovery of Genes for Improved Cellulose and Cellulose-Extractability from Poplar Secondary Xylem Jill L Wegrzyn 1, Jennifer M. Lee 2, Andrew J. Eckert](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5697bfc11a28abf838ca423e/html5/thumbnails/15.jpg)
• Polyphred score– Polyphred assigns a score calculated primarily from sequence depth and quality score.
• Base frequencies– The number of occurrences of different bases at the position of variation is important in
determining a polymorphic position.
– Frequencies of the first (major allele) and the second (minor allele) represented as ratio to sequence depth.
• Relative distance– Sequence quality at the ends of the alignment tends to be poor due to inherent limitations of
current sequencing technology.
– SNP position was represented as the ratio of the distance in the consensus sequence from the closest end, or the relative distance
Alignment and SNP IdentificationFeature Representation
![Page 16: Discovery of Genes for Improved Cellulose and Cellulose-Extractability from Poplar Secondary Xylem Jill L Wegrzyn 1, Jennifer M. Lee 2, Andrew J. Eckert](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5697bfc11a28abf838ca423e/html5/thumbnails/16.jpg)
• Sequence quality– Variation is observed because of a poor quality base. – Based on the base frequencies calculated: – maximum qualities of the major and minor alleles– average qualities of major and minor alleles
• Alignment quality– Misalignment of bases caused by sequence alignment programs
sometimes result in an erroneous SNP call. – In the neighborhood of the SNP (+/- 10 bases) all the
mismatches with the consensus sequence are given a penalty and the penalty is more if the mismatch is continuous
Alignment and SNP IdentificationFeature Representation
![Page 17: Discovery of Genes for Improved Cellulose and Cellulose-Extractability from Poplar Secondary Xylem Jill L Wegrzyn 1, Jennifer M. Lee 2, Andrew J. Eckert](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5697bfc11a28abf838ca423e/html5/thumbnails/17.jpg)
• Training set for loblolly pine was composed of a total of 300 validated sequences. – Divided to represent the relative percentages of sequence
source– Testing set is composed of 120 validated sequence sets
• Training set for poplar was composed of 42 validated sequences selected at random– Testing set is composed of a total of 30 validated sequence
sets.
• Validation = manually observed FP, FN, TP, and TN SNP calls through observation of tracefiles in Consed.
Alignment and SNP IdentificationSNP Identification Datasets
![Page 18: Discovery of Genes for Improved Cellulose and Cellulose-Extractability from Poplar Secondary Xylem Jill L Wegrzyn 1, Jennifer M. Lee 2, Andrew J. Eckert](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5697bfc11a28abf838ca423e/html5/thumbnails/18.jpg)
GOAL: Prediction
Learn a function or set of functions that assign a record to one of
several predefined classes.
Decision tree C4.5 program is open-source C code (WEKA) - J48
– At each point in the construction of the decision tree, C4.5 selects the feature to test based on maximum information gain.
– The goal is to generate a minimum size tree that correctly classifies all the SNP calls in the training set.
Alignment and SNP IdentificationClassification
![Page 19: Discovery of Genes for Improved Cellulose and Cellulose-Extractability from Poplar Secondary Xylem Jill L Wegrzyn 1, Jennifer M. Lee 2, Andrew J. Eckert](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5697bfc11a28abf838ca423e/html5/thumbnails/19.jpg)
![Page 20: Discovery of Genes for Improved Cellulose and Cellulose-Extractability from Poplar Secondary Xylem Jill L Wegrzyn 1, Jennifer M. Lee 2, Andrew J. Eckert](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5697bfc11a28abf838ca423e/html5/thumbnails/20.jpg)
• Accuracy = (TP + TN)/total
• Sensitivity = TP/(TP + FN)
• Specificity = TN/(FP + TN)
Alignment and SNP IdentificationEvaluation Criteria
Evaluation J48 Polyphred Polybayes
Accuracy 93.6 76.25 78.02
Sensitivity 88.21 83.22 86.54
Specificity 98.73 N/A N/A
Evaluation J48 Polyphred Polybayes
Accuracy 94.6 79.35 80.24
Sensitivity 90.54 85.01 88.14
Specificity 97.23 N/A N/A
![Page 21: Discovery of Genes for Improved Cellulose and Cellulose-Extractability from Poplar Secondary Xylem Jill L Wegrzyn 1, Jennifer M. Lee 2, Andrew J. Eckert](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5697bfc11a28abf838ca423e/html5/thumbnails/21.jpg)
PineSAP
• PineSAP alignment improves– Inaccuracies introduced by using Phrap to align
sequences– Time which would be required by using a aligner such as
ProbconsRNA or ClustalW on its own– PineSAP has a 98% success rate when used to align loblolly
resequencing data.• PineSAP identified a success list of features to enhance
polymorphism predictions• PineSAP obtained an overall prediction accuracy of 93% in
SNP Identification• PineSAP provided a full alignment and polymorphism
detection system that can be adapted to specific genomes
![Page 22: Discovery of Genes for Improved Cellulose and Cellulose-Extractability from Poplar Secondary Xylem Jill L Wegrzyn 1, Jennifer M. Lee 2, Andrew J. Eckert](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5697bfc11a28abf838ca423e/html5/thumbnails/22.jpg)
• Total of 202 amplicons• Number of SNPs Identified - 1486
– Meet a minimum confidence score from the PineSAP pipeline
• Average number of SNPs/amplicon ~ 7• Amplicon length ~ 600 - 700bp• Remaining SNPs generated from 232 additional
genes.– Utilized an eSNP method with publicly available EST
data and reference genome from JGI.– Identified a total of 1,232 potential SNPs
Alignment and SNP IdentificationSNPs Identified
![Page 23: Discovery of Genes for Improved Cellulose and Cellulose-Extractability from Poplar Secondary Xylem Jill L Wegrzyn 1, Jennifer M. Lee 2, Andrew J. Eckert](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5697bfc11a28abf838ca423e/html5/thumbnails/23.jpg)
Polyphred style output is transformed into Illumina style input
-adding IUPAC codes for SNPs in flanking sequence
Alignment and SNP IdentificationSNP Formatting
![Page 24: Discovery of Genes for Improved Cellulose and Cellulose-Extractability from Poplar Secondary Xylem Jill L Wegrzyn 1, Jennifer M. Lee 2, Andrew J. Eckert](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5697bfc11a28abf838ca423e/html5/thumbnails/24.jpg)
SNP GenotypingIllumina GoldenGate Assay
![Page 25: Discovery of Genes for Improved Cellulose and Cellulose-Extractability from Poplar Secondary Xylem Jill L Wegrzyn 1, Jennifer M. Lee 2, Andrew J. Eckert](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5697bfc11a28abf838ca423e/html5/thumbnails/25.jpg)
Alignment and SNP IdentificationIllumina Design
![Page 26: Discovery of Genes for Improved Cellulose and Cellulose-Extractability from Poplar Secondary Xylem Jill L Wegrzyn 1, Jennifer M. Lee 2, Andrew J. Eckert](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5697bfc11a28abf838ca423e/html5/thumbnails/26.jpg)
• All SNP and amplicon information is databased.• SQL queries can be used to select specific SNPs
– Pair-wise comparisons of all SNPs – Scores were assigned to each pair of SNPs in each amplicon,
accounting for distance between the SNPs, Illumina score for both SNPs, and frequency of minor allele
• We can also use SQL queries to select SNPs and minimize additional SNPs in flanking sequence
Alignment and SNP IdentificationSNP Selection
![Page 27: Discovery of Genes for Improved Cellulose and Cellulose-Extractability from Poplar Secondary Xylem Jill L Wegrzyn 1, Jennifer M. Lee 2, Andrew J. Eckert](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5697bfc11a28abf838ca423e/html5/thumbnails/27.jpg)
Mass spectrometer
Transfer line and interface
Autosampler
48 sample tray
Mass spectrometer
Transfer line and interface
Autosampler
48 sample tray
Pyrolysis Molecular Beam Mass Spectrometry Analysis
cell wall chemistry
lignin
hemicellulose
cellulose
![Page 28: Discovery of Genes for Improved Cellulose and Cellulose-Extractability from Poplar Secondary Xylem Jill L Wegrzyn 1, Jennifer M. Lee 2, Andrew J. Eckert](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5697bfc11a28abf838ca423e/html5/thumbnails/28.jpg)
![Page 29: Discovery of Genes for Improved Cellulose and Cellulose-Extractability from Poplar Secondary Xylem Jill L Wegrzyn 1, Jennifer M. Lee 2, Andrew J. Eckert](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5697bfc11a28abf838ca423e/html5/thumbnails/29.jpg)
Acknowledgements
Chung-Jui TsaiMike DavisDavid NealeJill WegrzynJennifer Lee
Andrew EckertJohn Liechty
Funding:
Brian StantonRich Shuren