applied bioinformatics - vanderbilt...

27
Applied Bioinformatics Bing Zhang Department of Biomedical Informatics Vanderbilt University [email protected]

Upload: others

Post on 23-Sep-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture01.pdfWhat is bioinformatics Data driven science: the creation and advancement of

Applied Bioinformatics

Bing Zhang Department of Biomedical Informatics

Vanderbilt University

[email protected]

Page 2: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture01.pdfWhat is bioinformatics Data driven science: the creation and advancement of

Course overview

  What is bioinformatics   Data driven science: the creation and

advancement of databases, algorithms, and computational and statistical methods to solve theoretical and practical problems arising from the management and analysis of biological data.

  Major research areas: sequence alignment, gene finding, genome assembly, protein structure prediction, gene expression and regulation, protein interaction, drug design, genome-wide association studies, computational evolutionary biology etc.

  Applied bioinformatics module   Not a comprehensive guide to all facets of

bioinformatics

  To equip you with the computational understanding and expertise needed to solve bioinformatics problems that you will likely encounter in your research.

Applied Bioinformatics, Spring 2011 2

http://www.ncbi.nlm.nih.gov/genbank/genbankstats.html

Page 3: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture01.pdfWhat is bioinformatics Data driven science: the creation and advancement of

Course overview

  What is bioinformatics   Data driven science: the creation and

advancement of databases, algorithms, and computational and statistical methods to solve theoretical and practical problems arising from the management and analysis of biological data.

  Major research areas: sequence alignment, gene finding, genome assembly, protein structure prediction, gene expression and regulation, protein interaction, drug design, genome-wide association studies, computational evolutionary biology etc.

  Applied bioinformatics module   Not a comprehensive guide to all facets of

bioinformatics

  To equip you with the computational understanding and expertise needed to solve bioinformatics problems that you will likely encounter in your research.

Applied Bioinformatics, Spring 2011 3

http://www.bioinformatics.ca/links_directory/

Page 4: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture01.pdfWhat is bioinformatics Data driven science: the creation and advancement of

Course content and grades

Applied Bioinformatics, Spring 2011 4

!Date Subject Instructor Homework (HW) 2/14 Finding information about genes Zhang 2/16 Navigating sequenced genomes Zhang 2/18 Pairwise sequence alignment and database search Zhao 2/21 Multiple sequence alignment Zhao 2/23 Inferring phylogenetic relationships from sequence data Zhao

HW I distribution 20 pts Zhao + 10 pts Zhang

2/25 Protein sequence annotation Tabb 2/28 Protein structure prediction and visualization Tabb HW I due 3/2 Protein identification by mass spectrometry Tabb HW II distribution

20 pts Tabb 3/4 Gene prediction and annotation Bush 3/7 Finding regulatory and conserved elements in DNA sequence Bush HW II due 3/9 Assessing the impact of genetic variation

Bush

HW III distribution 20 pts Bush

3/11 Supervised analysis of gene expression data Zhang 3/14 Unsupervised analysis of gene expression data Zhang HW III due 3/16 Functional interpretation of gene lists Zhang 3/18 Biological pathways Zhang 3/21 Biological networks Zhang HW IV distribution

30 pts Zhang 3/25 HW assignments will be graded by each instructor for their respective

sections. Final Grade = sum of the hw scores (100 pts in total). A: 85-100; B: 70-84; C: 55-69; D: 40-54; F: 0-39

Homework IV due by 5pm

!

Page 5: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture01.pdfWhat is bioinformatics Data driven science: the creation and advancement of

Course materials and assignments

  Lecture slides available at https://medschool.mc.vanderbilt.edu/iidea/admin/admin_login.php before each lecture

  Homework assignments available at https://medschool.mc.vanderbilt.edu/iidea/admin/admin_login.php on the distribution date (2/23, 3/2, 3/9, 3/21)

  Homework assignments are due at 5pm on the due date (2/28, 3/7, 3/14, 3/25). There will be a 10% per day deduction for late reports.

  Email your reports in the pdf, doc, or docx format to corresponding instructor(s)   HW I: [email protected]; [email protected]

  HW II: [email protected]

  HW III: [email protected]

  HW IV: [email protected]

  Text book (optional): Dear, Paul H. (2007) Methods Express: Bioinformatics. Scion, ISBN 978-1904842163.

Applied Bioinformatics, Spring 2011 5

Page 6: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture01.pdfWhat is bioinformatics Data driven science: the creation and advancement of

Finding information about genes

Bing Zhang Department of Biomedical Informatics

Vanderbilt University

[email protected]

Page 7: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture01.pdfWhat is bioinformatics Data driven science: the creation and advancement of

When do we need gene information?

  Case 1   From Prof. Randy Blakely (Pharmacology): “We have hit an

uncharacterized gene in our hunt for SERT interacting proteins=****** that appears to be highly depleted when extracts are made from SERT KO mice. Can you help us come up with some ideas as to what this gene might be.”

  Case 2   From Prof. Kevin Schey (Biochemistry): “I’ve attached a spreadsheet of

our proteomics results comparing 5 Vehicle and 5 Aldosterone treated patients. We’ve included only those proteins whose summed spectral counts are >30 in one treatment group. Would it be possible to get the GO annotations for these? The Uniprot name is listed in column A and the gene name is listed in column R. If this is a time consuming task (and I imagine that it is), can you tell me how to do it?”

Applied Bioinformatics, Spring 2011 7

Page 8: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture01.pdfWhat is bioinformatics Data driven science: the creation and advancement of

Resources

  Entrez Gene   http://www.ncbi.nlm.nih.gov/gene

  NCBI/NIH

  All completely sequenced genomes

  One gene per page

  Ensembl BioMart   http://www.ensembl.org/biomart/martview

  EMBL-EBI and Sanger Institute

  Vertebrates and other selected eukaryotic species

  Batch information retrieval

  Gene Cards   http://www.genecards.org

  Weizmann Institute of Science, Israel

  Comprehensive information on human genes

  WikiGenes   http://www.wikigenes.org

  MIT

  Collaborative annotation in a wiki system

  GLAD4U   http://bioinfo.vanderbilt.edu/glad4u

  Vanderbilt

  Genes related to a specific topic

Applied Bioinformatics, Spring 2011 8

Page 9: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture01.pdfWhat is bioinformatics Data driven science: the creation and advancement of

Learning objectives

  To gain a basic understanding of the Entrez Gene system

  To be able to retrieve information for individual genes using Entrez Gene

  To gain a basic understanding of the Ensembl BioMart system

  To be able to retrieve information for a list of genes using Ensembl BioMart

Applied Bioinformatics, Spring 2011 9

Page 10: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture01.pdfWhat is bioinformatics Data driven science: the creation and advancement of

Entrez Gene: overview

  Data source   Automated analyses and curation by NCBI staff

  Data stored in flat files

  Updated continuously

  Unique gene identifier   Entrez Gene uses unique integers (GeneID) as stable identifiers for genes, e.g. GeneID for

human tumor protein p53 (TP53) is 7157

  GeneID assigned to each record is species specific, e.g. GeneID for the mouse ortholog of TP53 (Trp53) is 22059

  Statistics as of February 2011   7.2 million records distributed among 7039 taxa

  45,227 records for human

  Query system   Entrez

Applied Bioinformatics, Spring 2011 10

Page 11: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture01.pdfWhat is bioinformatics Data driven science: the creation and advancement of

Entrez Gene: Entrez

  An integrated search and retrieval system that provides access to many discrete databases at the NCBI website.

  All databases indexed by Entrez can be searched via a single query string, including Entrez Gene

  Supports Boolean operators   AND, OR, NOT

  Supports search term tags to limit search to particular fields   Title, organism, etc.

  Sample query   transporter[tit le] AND (”Homo sapiens"[organism] OR "Mus

musculus"[organism])

Applied Bioinformatics, Spring 2011 11

Page 12: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture01.pdfWhat is bioinformatics Data driven science: the creation and advancement of

Entrez Gene: search result

Applied Bioinformatics, Spring 2011 12

Display Setting

Summary record

Advanced search

Filtering

Related data

Help

Page 13: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture01.pdfWhat is bioinformatics Data driven science: the creation and advancement of

Entrez Gene: Gene record (I)

  Each Gene record integrates multiple types of information   Gene type: tRNA, rRNA, snRNA, scRNA, snoRNA, miscRNA, protein-

coding, pseudo, other, and unknown

  Nomenclature, summary descriptions, accessions of gene specific and gene product-specific sequences, chromosomal location, reports of pathways and protein interactions, associated markers and phenotypes

  Links to other databases at NCBI including literature citations, sequences, variations, and homologs

  Links to databases outside of NCBI

Applied Bioinformatics, Spring 2011 13

Page 14: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture01.pdfWhat is bioinformatics Data driven science: the creation and advancement of

Entrez Gene: Gene record (II)

Applied Bioinformatics, Spring 2011 14

http://www.ncbi.nlm.nih.gov/gene/7157

Help Expand

Export New search

Page 15: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture01.pdfWhat is bioinformatics Data driven science: the creation and advancement of

Entrez Gene: advanced ways of accessing

  FTP download   ftp://ftp.ncbi.nlm.nih.gov/gene/README

  E-Utilities (Entrez Programming Utilities)   Server-side programs that provide a stable interface into the Entrez query and

database system

  Uses a fixed URL syntax that translates a standard set of input parameters into the values necessary for various NCBI software components to search for and retrieve the requested data, including nucleotide and protein sequences, gene records, three-dimensional molecular structures, and the biomedical literature.

  Works with any computer language that can send a URL to the E-utilities server and interpret the XML response, e.g. Perl, Python, Java, and C++.

  Combining E-utilities components to form customized data pipelines within these applications is a powerful approach to data manipulation.

Applied Bioinformatics, Spring 2011 15

Page 16: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture01.pdfWhat is bioinformatics Data driven science: the creation and advancement of

Entrez Gene: documentation and publications

Applied Bioinformatics, Spring 2011 16

http://www.ncbi.nlm.nih.gov/books/NBK3841/

Maglott et al. NAR, 39:D52-D57, 2011

Page 17: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture01.pdfWhat is bioinformatics Data driven science: the creation and advancement of

Entrez Gene: exercise

  Questions   How many records can we get for a simple search of “kinase” in Entrez Gene?

  Use Boolean operators and search term tags to search for mouse genes located on chromosome 1 and with kinase in title. With the default display setting, what is the first hit?

  Click on the first hit and identify how many publications in PubMed are associated with this gene.

  Identify which proteins interact with the protein product of this gene.

  Answers   244,301 records

  Query term: kinase[title] AND mouse[Organism] AND 1[Chromosome]

  Epha4

  Bibliograph section: 220 citations in PubMed

  Interactions section: 3 proteins, Epha4, Ngef, and Vav2

Applied Bioinformatics, Spring 2011 17

Page 18: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture01.pdfWhat is bioinformatics Data driven science: the creation and advancement of

Ensembl

  Genome databases for vertebrates and other selected eukaryotic species   Automated annotation system at EBI

  Data stored in a relational database

  Updated periodically with versions

  Unique gene identifier   Ensembl uses unique strings (Ensembl gene ID) as stable identifiers for genes, e.g. Ensembl

gene stable ID for human tumor protein p53 (TP53) is ENSG00000141510

  GeneID assigned to each record is species specific, e.g. Ensembl gene stable ID for the mouse ortholog of TP53 (Trp53) is ENSMUSG00000059552

  Clear gene, transcript, and protein relationship, e.g. ENSG00000141510 => 17 transcripts (e.g. ENST00000445888) => 13 proteins (e.g. ENSP00000391478)

  Statistics as of February 2011 (version 61)   55 species

  53,630 genes for human

  Other species available in the recently expanded system EnsemblGenomes   http://www.ensemblgenomes.org

Applied Bioinformatics, Spring 2011 18

Page 19: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture01.pdfWhat is bioinformatics Data driven science: the creation and advancement of

  Biomart is a query-oriented data management system.

  Batch information retrieval for complex queries

  Particularly suited for providing 'data mining' like searches of complex descriptive data such as those related to genes and proteins

  Open source and can be customized

  Originally developed for the Ensembl genome databases

  Adopted by many other projects including UniProt, InterPro, Reactome, Pancreatic Expression Database, and many others (see a comp le te l i s t and ge t access to t he too l s f rom http://www.biomart.org/ )

Biomart: a batch information retrieval system

Applied Bioinformatics, Spring 2011 19

Page 20: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture01.pdfWhat is bioinformatics Data driven science: the creation and advancement of

BioMart: basic concepts

  Dataset

  Filter

  Attribute

  From Prof. Kevin Schey (Biochemistry): “I’ve attached a spreadsheet of our proteomics results comparing 5 Vehicle and 5 Aldosterone treated patients. We’ve included only those proteins whose summed spectral counts are >30 in one treatment group. Would it be possible to get the GO annotations for these? The Uniprot name is listed in column A and the gene name is listed in column R. If this is a time consuming task (and I imagine that it is), can you tell me how to do it?”

  From all human genes, selected those with the listed Uniprot IDs, and retrieve GO annotations.

Applied Bioinformatics, Spring 2011 20

Page 21: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture01.pdfWhat is bioinformatics Data driven science: the creation and advancement of

  Choose dataset   Choose database: Ensembl Genes 61

  Choose dataset: Homo sapiens genes (GRch37)

  Set filters   Gene: a list of genes/proteins identified by various database IDs (e.g. IPI IDs)

  Gene Ontology: filter for proteins with specific GO terms (e.g. cell cycle)

  Protein domains: filter for proteins with specific protein domains (e.g. SH2 domain)

  Region: filter for genes in a specific chromosome region (e.g. chr1 1:1000000 or 11q13)

  Others

  Select output attributes   Gene annotation information in the Ensembl database, e.g. gene description, chromosome

name, gene start, gene end, strand, band, gene name, etc.

  External data: Gene Ontology, IDs in other databases

  Expression: anatomical system, development stage, cell type, pathology

  Protein domains: SMART, PFAM, Interpro, etc.

Ensembl Biomart analysis

Applied Bioinformatics, Spring 2011 21

Page 22: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture01.pdfWhat is bioinformatics Data driven science: the creation and advancement of

Ensembl BioMart: query interface

Applied Bioinformatics, Spring 2011 22

Choose dataset

Set filters

Help Results Count Perl API

Select output attributes

Page 23: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture01.pdfWhat is bioinformatics Data driven science: the creation and advancement of

Ensembl Biomart: sample output

Applied Bioinformatics, Spring 2011 23

Export all results to a file

Page 24: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture01.pdfWhat is bioinformatics Data driven science: the creation and advancement of

Ensembl Biomart: documentation and publications

Applied Bioinformatics, Spring 2011 24

http://www.ensembl.org/info/website/tutorials/index.html

Smedley et al. BMC Genomics, 10:22, 2009

Page 25: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture01.pdfWhat is bioinformatics Data driven science: the creation and advancement of

Ensembl Biomart analysis: exercise 1

  Question   I have two Ensembl gene IDs, ENSG00000162367 and ENSG00000187048. How do I get

their gene names from HGNC, IDs from EntrezGene, and any probes that contain these gene sequences from the Affymetrix microarray platform HC G110?

  Choose data set   Database: Ensembl Gene 61

  Dataset: Homo sapiens genes (GRCh37.p2)

  Set filters   Under GENE: check ID list limit box

  Select Header: Ensembl Gene IDs, Enter the gene IDs into the box.

  Select output attributes   Select Features (default)

  Under EXTERNAL: External References, Select 'HGNC Symbol' and 'EntrezGene ID’

  Under EXTERNAL: Microarray, Select 'Affy HC G110’

  Click on Count and then Results

  Export all results to File, TSV

Applied Bioinformatics, Spring 2011 25

Page 26: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture01.pdfWhat is bioinformatics Data driven science: the creation and advancement of

Ensembl Biomart analysis: exercise 2

  Question   How can I get the 2kb upstream sequences for all genes on chromosome 1?

  Choose data set   Database: Ensembl Gene 61

  Dataset: Mus musculus genes (NCBIM37)

  Set filters   Under REGION: check Chromosome, select 1

  Select output attributes   Select Sequences

  Under SEQUENCES: select Flank (Gene)

  Under Upstream flank: check and enter 2000 into the box

  Under Header Information, Gene Information, check Description

  Click on Count (1916/36817) and then Results

  Export all results to File, FASTA format

Applied Bioinformatics, Spring 2011 26

Page 27: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture01.pdfWhat is bioinformatics Data driven science: the creation and advancement of

Summary

  Entrez Gene   http://www.ncbi.nlm.nih.gov/gene

  NCBI/NIH

  All completely sequenced genomes

  Data stored in flat files

  Updated continuously

  Unique gene identifier: Entrez Gene ID

  Query system: Entrez

  Output: one-gene-at-a-time

  Ensembl BioMart   http://www.ensembl.org/biomart/martview

  EMBL-EBI and Sanger Institute

  Mainly vertebrates

  Data stored in a relational database

  Updated periodically with versions

  Unique gene identifier: Ensembl Gene ID

  Query system: BioMart

  Output: multiple genes at the same time

Applied Bioinformatics, Spring 2011 27