comp 691r bioinformatics algorithms - concordia...
TRANSCRIPT
COMP 691R
Bioinformatics Algorithms
Lecture 1a
Course Outline
Classes
Fridays 17:45 to 20:15 in H-605
Instructor
Professor Greg Butler
Room: ER-603-53
Tel: 848 3031
http://www.cs.concordia.ca/~faculty/gregb/
Office Hours and Discussion
In class (preferable).
By appointment (email me with suggested time)
Lab
H-968
Bioinformatics Algorithms
— to cover the major algorithms used in bioinformatics
— emphasize algorithmic principles
Important
— algorithms use available databanks!
— practical performance of algorithms
— use on “farms” or “clusters”
Information Sources
See web site:http://www.cs.concordia.ca/~faculty/gregb/home/c691R.html
Books in Webster Library Reserve.
Selected journal and conference articles. Web sites.
Evaluation
four programming assignments (40%)
two data analysis assignments (20%)
three-hour final examination (40%)
Bioinformatics Algorithms — Real Objectives
Cover the necessary background on bioinformatics sothat you can understand the following three papers:
L. Duret and S. Abdeddaim, Multiple alignments forstructural, functional, or phylogenetic analyses of ho-mologous sequences. In Bioinformatics: Sequence,Structure and Databanks, editted by D. Higgins andW. Taylor, Oxford University Press, 2000.
C. Notredame, Recent progresses in multiple sequencealignment: a survey, Pharmacogenomics 3(1) (2002)131–144.
N. Kaminski and N. Friedman, Practical approaches toanalyzing results of microarray experiments, AmericanJournal of Respir. Cell Mol. Biol. 27 (2002) 125–132.
Lecture Schedule — Very Tentative
Week 1 (16 May):Course Outline. Biology and Genomics Introduction.
Weeks 2 – 5 (23 May – 13 June):Basic Algorithms on Sequences.• Sequence properties, scanning.• Alignment, multiple alignment, phylogeny.
Weeks 6–9 (20 June – 11 July):Sequence Analysis.• Basics of sequencing projects.• Sequence base calling, trimming, assembly.• Sequence analysis: Blast, InterProScan, PSORT II.• Secondary structure prediction.
Weeks 10–13 (18 July – 15 August):Microarray Data Analysis.• Basics of microarray experiments.• Normalization, differential expression.• Clustering.• Bayesian approaches.
Assignments
All course work is individual work.
Four programming assignments:
— must use C++
— basic algorithms from scratch using STL libraries
— advanced work may use existing librariesEMBOSS, R, Bioconductor libraries
— submit written report for each— describe algorithm, design, testing, results— source code listing as appendix to report
Two data analysis assignments
— use available tools and libraries
— given datasets
— submit 5–10 page written report for each
Assignments due about every 2 weeks.
— precise schedule to come
Final Examination
Focus is on material in 3 papers and the required back-ground.
Must know basic concepts of genomics.
Must know major algorithms
— purpose, how it works, complexity, limitations
— data structures, heuristics
Should be able to compare major algorithms
Should be able to propose a new algorithm for a dataanalysis problem using the major existing algorithms andinformation resources
COMP 691R
Bioinformatics Algorithms
Lecture 1b
Background on Biology and Genomics
Life
Taxonomic Classification of Man
Superkingdom: EukaryotaKingdom: MetazoaPhylum: ChordataClass: MammaliaOrder: PrimataFamily: HominidaeGenus: HomoSpecies: sapiens
Prokaryotes: a unicellular organism having cells lackingmembrane-bound nuclei
— includes bacteria and archaea
Eukaryotes: an organism with membrane-bound nucleiin its cells
— includes plants, animals, fungi
Archaea: are prokaryotes
— have some genes resembling eukaryotes
— often life in extreme environments
The Structure of Organisms
Organs
Tissues
Cells
Cellular Compartments
The Gene Ontology has a category for Cellular Com-ponent which includes subcellular structures, locations,and macromolecular complexes:
GO:0005575 : cellular_componentGO:0005623 : cell
GO:0005627 : ascusGO:0030424 : axonGO:0005933 : budGO:0000267 : cell fractionGO:0009986 : cell surfaceGO:0030425 : dendriteGO:0030312 : external encapsulating structureGO:0019861 : flagellumGO:0042601 : foresporeGO:0005622 : intracellularGO:0016020 : membraneGO:0030496 : midbodyGO:0042597 : periplasmic spaceGO:0030428 : septumGO:0005936 : shmooGO:0030427 : site of polarized growth
The Cell
Each bacterium is enclosed by a rigid cell wall composedof a protein-sugar molecule.The wall gives the cell its shape and surrounds the cyto-plasmic membrane, protecting it from the environment.
The cytoplasm, or protoplasm, of bacterial cells is wherethe functions for cell growth, metabolism, and replica-tion are carried out.
The nucleoid is a region of cytoplasm where the chro-mosomal DNA is located.It is not a membrane bound nucleus, but simply an areaof the cytoplasm where the strands of DNA are found.Most bacteria have a single, circular chromosome.
Ribosomes are microscopic “factories” found in all cellsThey translate the genetic code from the molecular lan-guage of nucleic acid to that of amino acids – the build-ing blocks of proteins.
Plasmids are small, extrachromosomal genetic struc-tures.Like the chromosome, plasmids are made of a circularpiece of DNA.Unlike the chromosome, they are not involved in repro-duction.
Pili (singular, pilus) are small hairlike projections.They assist the bacteria in attaching to other cells andsurfaces, such as teeth, intestines, and rocks.
Flagella (singular, flagellum) are hairlike structures thatprovide a means of locomotion.
The Cell
The nucleus s a highly specialized organelle that servesas the information and administrative center of the cell.
Mitochondria are oblong shaped organelles that arefound in the cytoplasm of every eukaryotic cell.In the animal cell, they are the main power generators,converting oxygen and nutrients into energy.
The endoplasmic reticulum is a network of sacs thatmanufactures, processes, and transports chemical com-pounds for use inside and outside of the cell.It is connected to the double-layered nuclear envelope,providing a connection between the nucleus and the cy-toplasm.
The Golgi apparatus is the distribution and shipping de-partment for the cell’s chemical products.It modifies proteins and fats built in the endoplasmicreticulum and prepares them for export to the outsideof the cell.
The Cell
The most important characteristic of plants is their abil-ity to photosynthesize, i.e. make their own food byconverting light energy into chemical energy.
This process is carried out in specialized organelles calledchloroplasts.
The Cell — Viruses
DNA — Deoxyribonucleic acid
The Genes and Proteins
Genes are segments of DNA encoding information thatultimately direct the production of RNA molecules thatserve a variety of functions, including:
— dictating the synthesis of proteins that perform awide variety of functions in the body
— regulating (turning on or turning off) the expressionof other genes
— forming structures in the cell ribosomes that are crit-ical for the manufacture of proteins
— transporting amino acids to the ribosomes for thecreation of proteins
DNA forms a double helix with a backbone of eachstrand of the helix consisting of a repeating ...sugar-phosphate-sugar-phosphate... polymer.
– the sugar is deoxyribose
Attached to the sugar ring is one of four nitrogen-containing bases:adenine (A), guanine (G), cytosine (C), thymine (T).
The Genes and Proteins
The combination of one of these nitrogenous bases, asugar molecule, and a phosphate molecule is called a nu-cleotide, the basic building block of the DNA molecule.
The double helix is held together by weak hydrogenbonds between each thymine and adenine base, as wellas between each guanine and cytosine base.
Each of these pairs is called a base pair, or “bp” forshort.
The two strands of DNA are complementary
The Genes and Proteins
Transcription is the synthesis of a messenger RNA(mRNA) by replication of the DNA
Translation is the synthesise of proteins by the ribo-somes which interpret the genetic code on the mRNA.
Genetic information is encoded in a sequence of threenucleotides termed codons.
Each codon represents one of 20 amino acids.
A protein is a polypeptide consisting of amino acids.
IUPAC Standard Codes
IUPAC nucleotide codes
nucleotide code Base--------------- -------A AdenineC CytosineG GuanineT (or U) Thymine (or Uracil)R A or GY C or TS G or CW A or TK G or TM A or CB C or G or TD A or G or TH A or C or TV A or C or GN any base. or - gap
IUPAC Standard Codes
IUPAC amino acid codes
IUPAC amino Three letteracid code code Amino acid
------------ ----- ----------A Ala AlanineC Cys CysteineD Asp Aspartic AcidE Glu Glutamic AcidF Phe PhenylalanineG Gly GlycineH His HistidineI Ile IsoleucineK Lys LysineL Leu LeucineM Met MethionineN Asn AsparagineP Pro ProlineQ Gln GlutamineR Arg ArginineS Ser SerineT Thr ThreonineV Val ValineW Trp TryptophanY Tyr Tyrosine
Genetic Code
First Position Second Position Third PositionU C A G
--------------------------------------------------Phe Ser Tyr Cys U
U Phe Ser Tyr Cys CLeu Ser STOP STOP ALeu Ser STOP Trp G
--------------------------------------------------Leu Pro His Arg ULeu Pro His Arg C
CLeu Pro Gln Arg ALeu Pro Gln Arg G
--------------------------------------------------Ile Thr Asn Ser UIle Thr Asn Ser C
AIle Thr Lys Arg AMET Thr Lys Arg G
--------------------------------------------------Val Ala Asp Gly UVal Ala Asp Gly C
GVal Ala Glu Gly AVal Ala Glu Gly G
--------------------------------------------------
MET — Methionine — is the START codon
Transcription and Translation
Cell Processes
The activity of organisms is determined by themechanisms within cells, between cells,and with the environment.
These include
— cell life cycle of reproduction
— repair and maintenance of cell structures
— development of the organism
— transcription, translation, and its regulation
— biosynthesis of compounds for the cell
— metabolism of matter from the environment
— transport of matter
— recovering of energy and matter (catabolism)
Gene Ontology category of biological process has
GO:0008150 : biological_processGO:0007610 : behaviorGO:0000004 : biological_process unknownGO:0009987 : cellular processGO:0007275 : developmentGO:0008371 : obsoleteGO:0007582 : physiological processesGO:0016032 : viral life cycle
Cell Processes
GO:0009987 : cellular processGO:0007154 : cell communicationGO:0008219 : cell deathGO:0030154 : cell differentiationGO:0008151 : cell growth and/or maintenanceGO:0006928 : cell motilityGO:0006944 : membrane fusion
Cell Processes
GO:0007275 : developmentGO:0009838 : abscissionGO:0007568 : agingGO:0030154 : cell differentiationGO:0007349 : cellularizationGO:0009790 : embryonic developmentGO:0009908 : floweringGO:0009292 : genetic transferGO:0040007 : growthGO:0007320 : inseminationGO:0002164 : larval developmentGO:0009933 : meristem organizationGO:0009653 : morphogenesisGO:0007389 : pattern specificationGO:0009791 : post-embryonic developmentGO:0040029 : regulation of gene expression, epigeneticGO:0000003 : reproductionGO:0009835 : ripeningGO:0007530 : sex determinationGO:0007548 : sex differentiationGO:0019827 : stem cell maintenance
Cell Processes — physiological processes
GO:0046717 : acid secretionGO:0008218 : bioluminescenceGO:0046849 : bone remodelingGO:0009758 : carbohydrate utilizationGO:0008151 : cell growth and/or maintenanceGO:0008015 : circulationGO:0010031 : circumnutationGO:0000746 : conjugationGO:0016265 : deathGO:0009900 : dehiscenceGO:0007586 : digestionGO:0030146 : diuresisGO:0007588 : excretionGO:0030198 : extracellular matrix organization and biogenesisGO:0009844 : germinationGO:0007599 : hemostasisGO:0042592 : homeostasisGO:0020021 : host cell immortalizationGO:0020012 : immune evasionGO:0007595 : lactationGO:0008152 : metabolismGO:0042303 : molting cycleGO:0009877 : nodulationGO:0009935 : nutrient uptakeGO:0007584 : nutritional response pathwayGO:0018987 : osmoregulationGO:0007567 : parturitionGO:0009405 : pathogenesisGO:0030432 : peristalsisGO:0015979 : photosynthesisGO:0009856 : post-pollinationGO:0007565 : pregnancyGO:0007585 : respiratory gaseous exchangeGO:0009719 : response to endogenous stimulusGO:0009605 : response to external stimulusGO:0006950 : response to stressGO:0030431 : sleepGO:0001659 : thermoregulation
Metabolism — Food to Waste in Animals
Metabolism — Glycolysis
Catabolism of Proteins and Amino Acids
Regulation
Gene Regulatory Networks
Cell Differentiation into Tissues
Protein Structure
Central Dogma
Amino acid sequence
determines protein structure
determines enzyme function
An active site is where the interaction takes place.
Protein Structure
Proteins are polymers of amino acids.
Amino acids are primary amines that contain an alphacarbon that is connected to an amino (NH3) group, acarboxyl group (COOH), and a variable side group (R)
Polymers of amino acids are created by linking an aminogroup to a caroboxyl group on another amino acid.
This is termed a peptide bond.
The 20 Amino Acids
Post-genomics Studies