genetics and molecular biology tutorial ii -- computational perspective the goal is to introduce...
Post on 15-Jan-2016
222 Views
Preview:
TRANSCRIPT
Genetics and Molecular Biology Tutorial II -- Computational
Perspective
The goal is to introduce some topics to individuals with a minimal background in
genetics/biology, and yet try to provide some examples of topics to maintain the interest of individuals with extensive biological/genetics
backgrounds.
2
Outline Gene structure
– genomic structure vs mRNA structure– coding and noncoding exons– introns– primary transcript processing
aside -- nonsense mediated mRNA degradation
– alternative splicing and differential polyadenylation– evolutionary conservation of coding and
noncoding sequences
3
Outline… Genomic structure
– repetitive sequences LINES and SINES
– example -- Y chromosome palindromes– C value paradox– genomes of model organisms
example– yeast genome and gene-chip– single/double knockouts
– cross-species sequence similarities for putative function identification example -- “chaperonine”
4
Fundamental Genetics and Probability Concepts
meiosis and sampling patterns of inheritance monogenic and complex inheritance
– phenocopy– reduced penetrance
DNA variation– polymorphisms, SNPs, and mutations
positional cloning
5
Gene Structure
6
Transcript Processing
DNA -> pre-mRNA -> mRNA -> protein
7
Nonsense mediated mRNA degradation
– unknown mechanism– more rapidly degrades mRNA containing– Lykke-Andersen, “mRNA quality control:
Marking the message for life or death.” Current Biology, 11, 2001.
8
Nonsense Mediated mRNA Degradation
9
Genome Structure -- repeat classesClass (blocks) Size of
RepeatChr Locations
Megasatellite (100s ofkb)
several kb various locations
RS447 4.7 kb ~50-70 copies on 4, several on 8untitled 2.5 kb ~400 copies on 4 and 19untitled 3.0 kb ~50 copies on XSatellite (100kb to Mbs) 5-171 bp centromericalphoid 171 bp centromeric hetero all chrsSau3 A family 68 bp centromeric hetero 1 9 13 14 15 21
22 6satellite 1 (AT rich) 25-48 bp centromeric hetero most chrssatellites 2 and 3 5 bp most chrsMinisatellite (0.1-20 kb) 6-64 bp At or close to telomerestelomeric family 6 bp all telomereshypervariable family 9-64 bp all chrs, often near telomeresMicrosatellite (<150bp)
1-4 bp dispersed through all chromosomes
10
C-Value ParadoxHartl, “Molecular melodies in high and low C,” Nat. Rev. Genetics, Nov 20001
refers to the massive, counterintuitive and seemingly arbitrary differences in genome size observed in eukaryotic organisms– Drosophila melanogaster 180 Mb– Podisma pedestris 18,000 Mb– difference is difficult to explain in view of
apparently similar levels of evolutionary, developmental, and behavioral complexity
11
Alternative Splicing Every conceivable pattern of alternative
splicing is found in nature. Exons have multiple 5’ or 3’ splice sites alternatively used (a, b). Single cassette exons can reside between 2 constitutive exons such that alternative exon is either included or skipped ( c ). Multiple cassette exons can reside between 2 constitutive exons such that the splicing machinery must choose between them (d). Finally, introns can be retained in the mRNA and become translated.
Graveley, “Alternative splicing: increasing diversity in the proteomic world.” Trends in Genetics, Feb., 2001.
12
Classic View of Gene No Longer Valid -- Strachan pg 185
Mechanism Frequency/Examples
multigenic transcription units rare. 18S, 28S, and 5.8S rRNA,mitochondria
alternative promoters common. dystrophin gene (8)
alternative splicing very frequent. slo gene (8cassettes), >500 mRNAs
alternative polyadenylation common. calcitonin gene (2)
RNA editing extremely rare. apolipoprotein Bgene (tissue specific editing –codon changed)
post-translational cleavage rare. may generate functionallyrelated polypeptides – hormones.insuline
13
Alternative Splicing Example -- Graveley 2001
14
Alternative PolyAdenylation
common in human RNA (Edwards-Gilbert 1997)
in many genes, 2 or more poly-A signals in 3’ UTR– alternative transcripts can show tissue
specificity alternative poly-A signals may be brought
into play following alternative splicing
15
Edwards-Gilbert. Nucleic Acids Res, 13, 1997
16
Evolution of the mitochondrial genome and origin of eukaryotic cells
17
Evolutionary Conservation of Coding and Noncoding Sequences
Sequencing of H. sapiens and model organisms is basis for comparative genomics
Generally, functional solutions (encoded as genes) across organisms allows us to compare gene sequences and infer function
protein functional/structural region == “domains” Intergenic regions are generally not conserved
(always exceptions)
18
Example - MKKS (UniGene Clusters)
human rat 87.4 % human mouse 84.9 % human cow 87.1 % mouse rat 97.8 % rat cow 91.0% mouse cow 85.1 % frog rat 62.5 %
19
Example - MKKS
20
21
Computational Approach to Using Conserved Regions
Problem -- want to screen genes for mutations
Conventional approach -- screen all exons of a single gene
Alternative -- identify domains with in multiple genes, and screen domains first, to optimize screening time and resources
22
Cross-Species Similarities
yeast– gene chip for hybridization/expression– complete genome (first eukaryote)– singe knockouts and double knockouts
23
Fundamental Genetics
meiosis– Hs are diploid– meiosis produces haploid gametes– mechanism for transmission of genetic
material to offspring– recombination by cross-over (Holliday
structure) or by independent segregation of homologous pairs
24
Fundamental Genetics (Background for Linkage Analysis)
Rule of Segregation– offspring receive ONE allele (genetic material) from
the pair of alleles possessed by BOTH parents Rule of Independent Assortment
– alleles of one gene can segregate independently of alleles of other genes
– (Linkage Analysis relies on the violation of Independent Assortment Rule)
25
Genetic Marker … Prelude to LA– A genetic marker allows for the observation of
the genetic state at a particular genomic location (locus). A genotype is the measured state of a genetic marker. May never be feasible to sequence cases directly.
– An “informative” marker is often “heterozygous,” or “polymorphic” and enables the observation of the inheritance of genetic material.
26
Monogenic and Polygenic Diseases– monogenic (Mendelian) -- one gene
“simple” (dominant and recessive) Mendelian inheritance direct correspondence between one gene mutation and one
disorder majority of disease genes found are monogenic
– polygenic -- (complex) multiple genes heterogeneity and epistasis combinatorics no longer have direct correspondence between one gene and
disorder majority of disorders are probably polygenic
– complexity of organisms and observed pathways
27
...Mongenic and Polygenic Diseases
phenocopy reduced penetrance
– Example -- sickle cell anemia “classic” recessive disorder defect in red blood cells (hemoglobin) but… infant hemoglobin gene can “leak” wide range of phenotypes
28
Examples
29
Examples
30
Example
31
BBS4 Pedigree
32
Hardy-Weinberg Equilibrium
Rule that relates allelic and genotypic frequencies in a population of diploid, sexually reproducing individuals if that population has random mating, large size, no mutation or migration, and no selection
Assumptions– allelic frequencies will not change in a population from
one generation to the next– genotypic frequencies are determined in a predictable
way by allelic frequencies– the equilibrium is neutral -- if perturbed, it will reestablish
within one generation of random mating at the new allelic frequency
33
34
H-W
f(AA) = p2
f(Aa) = 2pq f(aa) = q2
(p+q)2
(p2 + q2 + r2 + 2pq + 2pr + 2qr)= (p+q+r)2
35
Dominant and Recessive Penetrance Modeled
penetrance = P(pt | gt)
DD Dd dd
1 1 0
DD Dd dd
0.9 0.9 0.0
DD Dd dd
0 0 1
DD Dd dd
0 0 0.8
36
D-R Heterogeneous, DD Epistatic
AA Aa aaBB 1 1 0Bb 1 1 0bb 1 1 1
reduced penetrance 3,9,27,81,243… 3n
AA Aa aaBB 1 1 0Bb 1 1 0bb 0 0 0
37
Dom-Rec Heterozygous
Screen genes A, B?, b
38
Uninformative Marker
39
Informative Marker
40
Given the following observations: family structure, affection status, genotypes, and disease allele frequencies. Assuming a model for the disease, can we calculate the probability that these observations “fit” an assumed model???
41
Linkage
42
Linkage Analysis
Goal: find a marker “linked” to a disease gene. LOD score = log of likelihood ratio LR[θ;data] == k P[data; θ] theta = estimate of genetic distance
(recombination fraction) between marker and disease
= proportion of recombinant gametes/total gametes
43
…Linkage Analysis Linkage analysis calculates the likelihood that
the inheritance pattern of the phenotype (disease) is supported by the observed inheritance patterns (genotypes) in a pedigree.
– few monogenic models, easy to test– more difficult to find models explaining inheritance
in polygenic models– parameter maximization
44
Linkage Analysis Programs
FASTLINK - 2 point– O(n2), where n = number of markers
GeneHunter - multipoint, 2 point– O(n2), where n = number of people
45
Allele Sharing
tries to show that affected family members inherit the same chromosomal regions more often than expected by chance
46
Allele Sharing Example
Needs at least sibs.
47
Association Studies
“Allelic association studies provide the most powerful method for locating genes of small effect contributing to complex diseases and traits.” Daniels, Am J Hum Genet 62:1189-1197, 1998.
Linkage analysis – genome wide screen, 400 markers ~ 10 cM (10 MB),
association needs 4000+ polymorphic markers– generally need nuclear family or larger
Association finds “linkage disequilibruim”
48
Association Studies
“Association is simply a statistical statement about the co-occurrence of alleles or phenotypes. Allele A is associated with disease D if people who have D also have A more (or maybe less) often than would be predicted from the individual frequencies of D and A in the population.” Pg. 286 Human Molecular Genetics 2, Tom Strachan
49
Examples HLA-DR4 (antigen marker)
– 36% in UK– 78% with rheumatoid arthritis
CF( RFLP markers XV2.c (X1,X2), KM19(K1,K2))
– Marker Alleles CF(case) Normal(control)
– X1, K1 3 49– X1, K2 147 19– X2, K1 8 70– X2, K2 8 25– CF associated with X1, K2 in ‘89 (Strachan)
50
Linkage Disequilibrium
linkage equilibrium (aka Hardy-Weinberg) is true if– P(gt1,gt1’;gt2,gt2’) = P(gt1,gt1’)*P(gt2,gt2’) where
[P(haplotype)] case vs controls TDT (heterozygous marker transmitted),
HRR (untransmitted alleles as control) allelic associations (outbred populations)
maintained at only <= 1cM
51
Equilibrium
52
“SNPs” Single-Nucleotide Polymorphisms 1 every 1000 bp (estimated) 2,972,052 SNPs submitted to dbSNP
– dbSNP summary link– 50% of all SNPs are in question– 10% of UTRs have SNPs
100,000 - 500,000 SNPs needed Why don’t we do this?
– $$$
53
Homozygosity Mapping
54
Positional Cloning
55
Disease Gene Identification
SSCP -- single strand conformational polymorphism
PCR -- polymerase chain reaction– primers amplify template sequence
direct sequencing
BBS2 (Bardet-Biedl Syndrome)
56
BBS2 genetic mapping
C16 1 2 3 4 5 6 7 8 9101112
57
BBS2 genetic mapping
C16 1 2 3 4 5 6 7 8 9101112
unaffectedaffected
58
BBS4 Gene (Direct Sequencing)(Hs.26471)
59
BBS4 Deletion (by PCR)
exons 3 4
60
BBS4 Mutations (direct sequencing)
(R295P)
61
Summary
Disease Gene Identification– challenges– interval localization
genotyping and genetic markers, linkage analysis, allele sharing, association studies (“SNiPs”), homozygosity mapping
– disease gene identification techniques Take home
– A complex disorder (with interacting genes) has yet to be characterized
62
Demo -- installing a database A database organizes data Most common
– relational database (oracle, sybase)– perceived as a collection of tables,– where table is an unordered collection of rows– each row has a fixed number of fields, and each field
can store a predefined type of data value (date, integer, string, etc.)
simplest– flat file
63
Databases
NCBI BLAST Amazon Yahoo Several of our own
– genotypes– rat ESTs– eye clones from differential display– micro-array data
64
This space intentionally left blank
top related