bioinformatik i uebungen hubert hackl icbi.at/bioinf

BIOINFORMATIK I UEBUNGEN

HUBERT HACKLicbi.at/bioinf

Organisation

• 3 Übungen

• Kurze Einführung anschließend Labor

• Protokoll (je 2 Studierende, elektronisch doc, pdf ..)

• Abgabe der Übungen bis spätestens 22. Mai 2014

Termine

Übungsziele

• Kennlernen biologischer Datenbanken (NCBI, …)

• Arbeiten mit Protein- und DNA/RNA-Sequenzen

• Sequenzalignment (BLAST)

• Arbeiten mit Genome-Browsern (UCSC, Ensembl)

• Lösung praktischer Beispiele mit Online-Analyse (keine Programmierübung)

Biologischer Informationsfluss

Chromsome, Chromatin, DNA

Symbol Meaning Description

R A or G puRineY C or T pYrimidineW A or T Weak hydrogen bondsS G or C Strong hydrogen bondsM A or C aMino groupsK G or T Keto groupsH A, C, or T (U) not G, (H follows G)B G, C, or T (U) not A, (B follows A)V G, A, or C not T (U), (V follows U)D G, A, or T (U) not C, (D follows C)N G, A, C or T (U) aNy nucleotide

Nomenklatur von Nukleinsäuren

Base Symbol Occurrence

Adenin A DNA, RNAGuanin G DNA, RNACytosin C DNA, RNAThymin T DNAUracil U RNA

+ strand 5´-ACGGTCGCTGTCGGTAGC-3´- strand 3´-TGCCAGCGACAGCCATCG-5´

e.g. in fasta format : >gene sequence|gi12345|chr17|- GCTACCGACAGCGACCGT

DNA sequences are always from 5‘ to 3‘

Positions in the genome (genome assembly) are chromosome wise

e.g. human GRCh37/hg19

chr11:1-100 chr11:49,686,777-49,689,777

Positions in the chromosome start for both!! strands from position 1

+ strand 5´-ACGGTCGCTG…………TCGGTAGC-3´- strand 3´-TGCCAGCGAC…………AGCCATCG-5´

chr11:1 2523 2529

chr11:1 2523 2529

Nomenklatur

Regulation of transcription

mRNA processing

Translation, genetic code and reading frames

Peptid chain, amino acid sequence, proteins

Protein sequences are always form N-terminal end to C-terminal end

backbone

sidechains

E.g.. SCD sequence in fasta format

201020082003

Start Human Genome Project - ) komplettes HG - ) 3.000.000.000 bp - ) 20 Institute - ) 2.5000 Wissenschafter

Erste Entwurfsversionvon HG publiziert

Lander et al.,

Venter et al.,

Endversion von HG publiziert

Ende HGP

20011990

Start 1000 Genomes Project - ) detaillierter Katalog genetischer Variationen - ) 1000 anonyme Spender

Start ENCODE Project - ) Encyclopedia of DNA Elements - ) funktionale Elemente der DNA

Stand ENCODE Project - ) Endphase - ) Daten durch UCSC verfügbarStand 1000 Genomes Project - ) 4 “highly covered” Individuen - ) 1000 Genomes Browser

Projekte

National Library of Medicine (NLM)National Center for Biotechnology Information (NCBI)

• NIH (National Institute of Health)–Campus in Bethesda, Maryland, USA (gegründet 1836 - Budget >30 Mrd $)

• www.pubmed.gov• Datenbank wurde entwickelt um Zugang zu Zitaten und Abstracts

biomedizinischer Literatur zur Verfügung zu stellen• 2012 – 21 Mio Einträge von über 5000 Journalen• >700 Mio Online Suchen pro Jahr

PubMed

• Datenbank zur Verwaltung von Sequenzdaten

• Frei zugänglich

• Täglicher Datenaustausch mit EBI und DDBJ

• Neuer „Release“ alle zwei Monate

• 2012 > 149 Millionen Sequenzen (137 Milliarden bp)

• > 205.000 Spezien

• > 1150 komplette Genome

GenBank

• Textbasiertes Abfragesystem für > 30 Datenbanken– PubMed – OMIM– Nucleotide – Protein– Gene – dbSNP– GEO – ...

• Ergebnisse sind vorberechnet und verlinkt

• Mehr als 5.000.000 Suchen pro Tag• Batchmodus verfügbar• LinkOut service zu externen Datenbanken

Entrez

Entrez

RefSeq

• Best, comprehensive, non-redundant set of sequences

• For genomic DNA (NG_), transcript mRNA (NM_), other RNA (NR_) and protein (NP_)

• For major research organisms (2645 organisms)

• Based on GenBank derived sequences

• Ongoing curation by NCBI staff and collaborators, with review status indicated on each record (computational XM_, XP_)

Gene

• One record represents one single gene from an organism

• Gene-specific information such as map, sequence, expression, structure, function, homology, publications, links

• Can have one or more Refseq transcripts assigned (NM_)

• Official gene symbol and name, GeneID, aliases and other designations

• Online Mendelian Inheritance in Men

• Bibliographisches, krankheitszentriertes Kompendium

• Ursprünglich Buchform (MIM, Johns Hopkins University)

• Tägliche Updates

• Für Ärzte, Wissenschafter, Studenten und Ausbildner

• Links zu vielen Datenbanken (Literatur, Sequenzen...)

OMIM

Insulin• Polypeptid-Hormon

• Bildung: Betazellen der Langerhansinseln im Pankreas (Bauchspeicheldrüse)

• 51 Aminosäuren (2 Ketten)

• A mit 21 AS

• B mit 30 AS

• Schweineinsulin (1 AS unterschiedlich)

• Rinderinsulin (3 AS unterschiedlich)

• Glucosetransport in die Zelle und Blutzuckerregulation

• Hemmt in der Fettzelle Lipolyse und fördert Lipogenese

• In Leber und Muskelzelle wird Glykogenaufbau gefördert

Proinsulin

Vom Preproinsulin zum Insulin

• Verwendung von Schweine- und Rinderinsulin• Bildung von Antikörpern & allergische Reaktionen möglich• Versorgung eines Diabetikers: 50 Pankreata/Jahr

• Gentechnische Herstellung mit rekombinanter DNA Technologie

• Unterschiedliche Wirkungsdauer (zB. Dissoziation von Insulinhexameren) und Insulinanaloga

Insulin als Medikament

Exercise 1-1: Find difference between insulin sequence in pig and human

1.2 Show that C-peptide sequence is less conserved than A-chain and B-chain

2.1 Which genes/proteins are involved?2.2 On which chromosome (arm, cytogenetic band) genes are located?2.3 What is the position and strand on the human reference genome assembly?2.4 Can these genes also found in the mouse (location)?2.5 Are there common mutations i.e. non-synonymous SNPs known?2.6 What is the function of the encoded proteins?2.7 Find recent publications

Exercise 1-2: Find information on SICKLE CELL ANEMIA and KABUKI SYNDROM

bioinformatik i uebungen hubert hackl icbi.at/bioinf

Documents