application of bioinformatics in genetics research instructors: dr. henry baker dr. luciano...

30
Application of Bioinformatics in Genetics Research Instructors: Dr. Henry Baker Dr. Luciano Brocchieri Drs. Michele Tennant / & Rolando Milian Dr. Lei Zhou Course web page: Sakai/UFL for lecture notes and homework & http://159.178.28.30/GMS6014/home.htm for classroom practice.

Upload: ross-foster

Post on 22-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Application of Bioinformatics in Genetics Research

Instructors:

Dr. Henry Baker

Dr. Luciano Brocchieri

Drs. Michele Tennant / & Rolando Milian

Dr. Lei Zhou

Course web page: Sakai/UFL for lecture notes and homework & http://159.178.28.30/GMS6014/home.htm for classroom practice.

Application of Bioinformatics in Genetic Research

Time and location:

M. W. F. : 1:00-2:00 in CGRC.

Except: 1/14 & 1/16 in HSCL C2-3.

Evaluation

• 50% classroom participation

• 50% homework

History of bioinformatics – sequence analysis

• Sequence comparison• Similarity search• Phylogenetic analysis

• Structure predication• Gene prediction

Bioinformatics in the post genome era

• Information Representation.- many new types of data, such as Function,

Location, Interaction, Regulatory pathway, Expression profile, etc. needs to be recorded

• Data Management

- Infrastructure for inputting, managing, access and retrieval of relevant information in a “sea of databases”. Cloud computing.

• Systematics

The opportunity provided by genome sequence and genomic / proteomic technology is matched by the

challenge to bioinformatics / computational biology

Bioinformatics in the post genome era

• Whole genome sequencing - SNP and whole genome wide association studies.

• Genomic/proteomic expression profiling (RNA and protein levels).

• Epigenomics, Comparative genomics, …

• Regulatory pathway simulation – systems biology.

$1,000 genome and … $500,000 analysis ?

Objectives of GMS6014

• Basic skills for retrieving and storing data, using web-based applications.

• Ability to install and run standalone local applications.

• Understanding the basis of bioinformatics applications using sequence similarity search as the example.

• A brief survey of available bioinformatics tools for HTS analysis and introduction to functional genomics and systems biology.

Sequence Representation - nucleotide

N G R C W T G Y C Y

A G A C A T G C C CC G T T TGT

For complete list, see table 2.1, Mount 2nd Ed

Or http://www.ncbi.nlm.nih.gov/blast/fasta.shtml

Sequence Representation - amino acids

Q:

What’s the common property of these amino acids ?

1. D, E

2. I, L, V, M, F

3. A, S, P

Sequence Representation - amino acids

Example:

Coloring based on aa property.

W D L L A Q I L C Y A L R I Y

W R F L A T V V L E T L R Q Y

W K F L A I T M C K V L K Q F

R C L L C N K L Y Y L L R K V

L N R L L A E L Y E V L C H I

L R L L Q Q Q Q M V L Q R Q Y

W D L L A Q I L C Y A L R I Y

W R F L A T V V L E T L R Q Y

W K F L A I T M C K V L K Q F

R C L L C N K L Y Y L L R K V

L N R L L A E L Y E V L C H I

L R L L Q Q Q Q M V L Q R Q Y

Representation of sequence – sequence file format

1.) FASTA – simple and clean

> gene_name, (other info)

MASASASKJHKLJLKJLDSDFSF

SSDSASFSFD…

Practice / DIY: retrieve sequence in Fasta format and save the file in the local computer.

How to store sequence files

• .txt format is clean and allows down stream sequence analysis

• .doc or .rtf allows formatting during annotation – however, extra information are inserted thus NOT suitable for computational analysis.

Practice – file types

• Using Windows Explorer (with your own computer) or IE with “C:\” in the address window.

• Change the “ToolsFolder Options” so that the file extensions (.xxx) are revealed.

• Edit the downloaded sequence file in MS Word, highlight a section of the sequence with Bold font or color and save as .doc

• Open the .doc file in NotePad – observe the inserted characters.

Practice – file types (Cont.)

• Load the “Mysequence.doc” file to Webcutter using “Choose file” and then “Upload sequence file”.-Notice that the “sequence” in the sequence box are

nonsense characters.• Clear input; Browse and then load the .txt

file. Run an analysis.

Always keep you sequences in .txt file for downstream analysis.

Representation of sequence

The need to include annotations and functional information with each sequence.

• Structured data entry• GeneBank• EMBL / SwissProt

Observe: The difference of data structure between SwissProt, NCBI protein, and NCBI Genes.

Representation of sequence

The need to represent associated info with sequence

• Structured data entry• Specialized databases

3-d StructureMutation / Diseases Protein family / Protein domainInteractionPathway….

Representation of sequence

The need to represent associated info with sequence

• Structured data entry• Specialized databases• Complex / customized data structure

- Object-oriented data representation (Mount, p44-45)

Public Resources for Bioinformatics

•Databases

•Analysis Tools

Observe: List of databases and service at NCBI, EBI, KEGG, and Ensembl.

What can we know about this gene? Search for “curated” databases. To prepare for future analysis, save annotated

sequence files as genename.html (in a target folder).

For downstream sequence analysis, save pure sequence as FASTA format file.

TNF, or your favorite gene

Pet Project:

Where and how much information are available for my gene?

Observe: The information contents and presentation format for the same gene in SwissProt, NCBI protein, NCBI Genes, etc..

Public Resources (I) – Databases and data sources

• Over 1,000 in the sea of databases.

• Content-specific, such as DNA, Protein, Structure, etc.

• Species-specific, such as flybase, wormbase, OMIM, etc.

• System-specific, such as MetaCyc, AFCS, etc.

Database concept:

Database - efficiently store, update, and retrieve information (data).

Database management systems – Access, Sybase MySQL, Oracle, etc.

Types of Databases – Relational DB, Object DB, native XML DB.

Database concept – tables in relational databases

Accession

Organ. Ref. Name Key words

Features

…. ….. medline1 TNF ….. ……. …..

…. …. medline2 P53 …. …….. ……

“TNF”=TNF[All Fields] TNF[Name]

Protein table

Database concept – relationship between tables

Accession

Organ. Ref. Name Key words

Features

…. ….. medline1 P27 ….. ……. …..

…. …. medline2 P53 …. …….. ……

Protein tableID title year author abstract

medline1 ….. 1970 …. ….. …..

medline2 …. 1980 …. …. …

Reference table

Representation of sequence

The need to represent associated info with sequence

• Structured data entry• Specialized databases• Complex / customized data structure

- Object-oriented data representation (Mount, p44-45)

Observe/Practice

• Search for TNF in the Gene database and the Nucleotide and Proteins databases.

• Search for TNF in “All Text” v.s “gene name” the in the Gene database.

• Compare results. • Download the human TNF nucleotide sequence.• Download three protein sequences in FASTA

format from the RefSeq search result save as 3TNF.txt.

Public Resources (II) – Analysis tools

Web-based analysis tools – easy to use, but often with less customization options.

Stand-alone analysis tools – requires installation and configuration, but provides more customizatio0n options.

Commercial analysis tools Scripting for bioinformatics projects

Practice: navigating the related resources through links

• Using the “PubMed” link, search annotated references on TNF.

• Using the “GEO Profiles” link, search gene expression information on TNF.

• Using the “Map Viewer” link to observe the chromosome location and gene structure of the TNF locus – change the option of “Map Viewer” to include prediction of CpG island.

Bioinformatics / Computational biology

• Bioinformatics - Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data.

• Computational Biology - The development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems.

(Working Definition of Bioinformatics and Computational Biology - July 17, 2000). NIH / BISTI