interdisciplinary introductory course in bioinformatics

39
Interdisciplinary Introductory Course in Bioinformatics Yana Kortsarts Computer Science Department Robert Morris Biology Department Janine Utell English Department, Widener University, Chester, PA

Upload: cliff

Post on 25-Feb-2016

64 views

Category:

Documents


1 download

DESCRIPTION

Interdisciplinary Introductory Course in Bioinformatics. Yana Kortsarts Computer Science Department Robert Morris Biology Department Janine Utell English Department, Widener University, Chester, PA. What is Bioinformatics?. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Interdisciplinary Introductory Course in Bioinformatics

Interdisciplinary Introductory Course

in BioinformaticsYana Kortsarts

Computer Science DepartmentRobert Morris

Biology DepartmentJanine Utell

English Department, Widener University, Chester, PA

Page 2: Interdisciplinary Introductory Course in Bioinformatics

What is Bioinformatics? Bioinformatics is a relatively new

interdisciplinary field that integrates computer science, mathematics, biology, and information technology to manage, analyze, and understand biological, biochemical and biophysical information.

Bioinformatics is a computational science and the subset of larger field of Computational Biology.

Page 3: Interdisciplinary Introductory Course in Bioinformatics

Motivation IS professionals must have strong analytical and critical

thinking skills. (IS 2002 Model Curriculum and Guidelines for Undergraduate Degree Programs in IS)

Introducing bioinformatics to CIS students will strengthen these required skills.

Equip students with some of the following capabilities as suggested in the IS 2002 guidelines: Creativity Application of both traditional and new concepts and skills Application development Problem solving abilities Ability to communicate effectively (oral, written and listening)

Page 4: Interdisciplinary Introductory Course in Bioinformatics

Motivation Provides opportunities for students to become familiar with

one of the most widely used script languages, Python Explore various data structures and algorithmic techniques

traditionally not covered in other courses. Helps students to make connections between theoretical

topics learned in core CS and CIS courses, such as Data Structures and Algorithms, and to apply their knowledge to real world biology problems.

Helps to diversify department course offering and provides interdisciplinary opportunities for CS and CIS students.

CIS and CS students with bioinformatics background clearly will enhance their employment qualifications in the competitive job market

Page 5: Interdisciplinary Introductory Course in Bioinformatics

Challenges Students have different backgrounds Choosing programming language Defining course prerequisites Defining course content

Programming Algorithms End User Bioinformatics Tools

Balanced course Content Hands-on/lecture

Interdisciplinary Nature

Page 6: Interdisciplinary Introductory Course in Bioinformatics

Course Development First Iteration: Spring 05; Second Iteration: Spring 08 Cross–listed upper level technical elective. Prerequisites:

Biology/Chemistry/Biochemistry majors: Introduction to Molecular Biology

CS/CIS/MATH majors: Introduction to Computer Science I Chemical Engineering Majors: Computer Programming and

Engineering Problem Solving Team teaching: Biology and Computer Science Faculty 4 credits, 6 hr: 4 – CS, 2 – Biology Spring 05 Enrollment: 6 Biology students, 6 CS/CIS Spring 08 Enrollment: 6 CS/CIS

Page 7: Interdisciplinary Introductory Course in Bioinformatics

Course Objectives and Goals To integrate bioinformatics algorithms into the

course and to teach the foundations of the algorithms and important results in bioinformatics

To introduce students to the Python programming language. Biopython Project is an international association of developers of freely available Python tools for computational molecular biology.

Page 8: Interdisciplinary Introductory Course in Bioinformatics

Course Objectives and Goals To introduce students to the principles that

drive an algorithm’s design and to intellectual content of bioinformatics

To provide an opportunity for interdisciplinary collaboration in the in-class assignments and the course project

Page 9: Interdisciplinary Introductory Course in Bioinformatics

Course Curriculum Ethics, Computing and Genomics

Project-Oriented Component, new for Spring 08 20% of the final grade, three weeks to work on this project Goal: developing oral and written communication skills and to engage

students in the knowledge exchange process Learning about the ethics, computing and genomics topic independently

and presenting the results of the self-learning. Students were assigned one or more scholarly articles from the

collection Ethics, Computing, and Genomics, edited by Herman Tavani. Students were required:

read assigned essays prepare 25-minute Power Point presentations with a summary of the

paper and answers to the questions posed in the introductory part of the corresponding section

prepare a mini-quiz to assess the understanding of the presented material by their peers.

Page 10: Interdisciplinary Introductory Course in Bioinformatics

Ethics, Computing and Genomics Collaborative Work with English faculty member English faculty member did short presentation before students started to

work on this assignment. discussion of how to read critically and what questions to ask while reading

the text; discussion of how to summarize the paper using the structure of the essay as

a guide and elucidating key points and key moments of evidence while making connections to the rest of the class material;

tips on writing the summary that include three steps: prewriting, drafting, revising;

discussion of how to design an effective presentation of information. Was present at all oral presentations and provided detailed notes for

each student explaining ways the presentation could have been stronger and also pointing out the positive and negative aspects of the presentation.

This successful and enjoyable experience showed the value of working with colleagues across disciplines to further student learning.

Page 11: Interdisciplinary Introductory Course in Bioinformatics

Course Curriculum: Introduction to Python Quickly introduce students to Python during first few

weeks of the course Working on different problem solving algorithmic

techniques. Introductory topics: arithmetic, decision and loop

structures, functions, simple manipulations with strings, lists, tuples and dictionaries.

Advanced Python topics were taught later throughout the course, building students’ knowledge and their abilities to tackle biology real-world problems.

The programming examples were all biology-oriented and motivated students to learn in order to solve practical problems.

Page 12: Interdisciplinary Introductory Course in Bioinformatics

Introduction to Python Spring 05: 6 biology students, no programming experience, 6 CS and

CIS students no experience in Python. 6 interdisciplinary teams, all concepts were practiced within the team Spring 08: all students were CS and CIS majors with prior

programming experience in C and Java, and some with introductory knowledge in Python

Special handouts were prepared to walk students through the introductory topics toward advanced Python concepts.

Each topic was supported by a list of examples in increasing order of complexity.

Students were required to run the proposed programs in order to gain understanding of basic Python structures.

To assess the understanding of each concept, students were required to write short programs solving biology-oriented problems.

Page 13: Interdisciplinary Introductory Course in Bioinformatics

Introduction to Python Examples of the problems, given here in increasing level

of complexity: computation of the alignment score between two DNA sequences

using different score matrices finding the maximal alignment score if no internal gaps are

allowed using different score matrices finding all occurrences of one sequence in another sequence writing a program that reads a DNA sequence, first transcribing

DNA into RNA and printing the resulting RNA sequence, then translating RNA into a protein sequence through the following: first, the program divides RNA into codons and prints the list of codons, and second, the codons are translated into the protein using genetic code table and finding the maximal alignment score if internal gaps are allowed using different score matrices.

Page 14: Interdisciplinary Introductory Course in Bioinformatics

Introduction to Python Spring 08: different levels of programming and computational

experience and the best way to cover this topic was through independent learning.

Handouts and Python and BioPython tutorials (www.python.org, www.biopython.org), worked each at their own pace.

Grading rubrics for each programming concept, minimal requirements to pass the specific concept, list of more advanced examples for students with prior Python experience.

Students with previous Python knowledge further advanced their experience and students new to Python learned the new programming language independently using structured guidance.

Python provides an opportunity to solve some problems in very short ways, and it was a very enjoyable experience for students to try to find a shortest solution for the proposed problems using Python functions and libraries.

Page 15: Interdisciplinary Introductory Course in Bioinformatics

Introduction to Bioinformatics Algorithms

Sequence alignments, scoring, gaps Algorithm Design Techniques: Exhaustive

Search, Dynamic Programming The Needleman and Wunsch Algorithm The Smith-Waterman Algorithm Introduction to BLAST Introduction to Multiple Sequence Alignments Visualization of algorithms:

ALGGEN – EMBER Web resources

Page 16: Interdisciplinary Introductory Course in Bioinformatics

Introduction to Bioinformatics Algorithms

Dynamic Programming technique usually is not covered in a core algorithms course

Provided an opportunity to expand the theoretical background and to make connections between theory and practice.

Helped to maintain an appropriate level of theoretical content required for upper-level elective courses in our department.

This topic was very well blended with biology topics and students had an opportunity to learn the concept of sequence alignments from biology and computer science points of view.

EMBER website provides a suite of multimedia bioinformatics educational tools, allows to create a set of hands-on activities to help students to gain understanding of the dynamic programming technique in general and specific algorithms in particular.

Page 17: Interdisciplinary Introductory Course in Bioinformatics

Course Curriculum – Biology Topics Biological Research on the Web

Public Biological Databases and Data Formats NCBI - National Center for Biotechnology Information Searching Biological Databases

Review of Molecular Biology and Biochemistry Concepts DNA and protein structure Gene expression (transcription and translation) Molecular Biology Central Dogma Sequence Alignments

Page 18: Interdisciplinary Introductory Course in Bioinformatics

Hands-On Activities Microbes Count! BioQUEST Curriculum Consortium Exploring HIV Evolution: An Opportunity for

Research. The HIV genome is very small and relatively simple. It is

made up of nine genes and about 9,500 nucleotides. In this lab students worked with HIV sequence data collected

from 15 individuals from an intravenous-drug-using population in Baltimore.

The goal of the study was to determine if the HIV isolated from particular subgroups of subjects derives from a common source.

CLUSTALW multiple sequence alignment tool Biology Workbench: http://workbench.sdsc.edu/

Page 19: Interdisciplinary Introductory Course in Bioinformatics

Biology Topics – Hands-On Activities Microarray Lab, developed by Campbell and Heyer, sold

by Carolina Biologicals, called DNA Chips: Genes to Disease. Understanding how microarrays are used to identify gene

changes in disease and the role of gene expression in cancer. Students compared the relative expression levels of six different

genes in healthy lung cells and lung cancer cells. After completing the lab, students had an opportunity to discuss

the significance of the relative expression levels with respect to the genes' roles in causing cancer

Page 20: Interdisciplinary Introductory Course in Bioinformatics

Biology Topics – Hands-On Activities Epidemiology - the study of the distribution of

diseases in populations. Explored factors that influence disease spread

throughout populations with the software Epidemiology. Ebola was used as a model organism and epidemiology was presented from both a microbiological and social perspective

Exploration of the structure and function of the insulin generate a phylogenetic tree demonstrating evolution

of insulin amongst the vertebrates - animals with an internal skeleton made of bone

Page 21: Interdisciplinary Introductory Course in Bioinformatics

Project: DNA Sequence Annotation. Real Data: Bacillus anthracis str. Ames project at J. Craig Venter

Institute Input DNA: about 50,000 nucleotides long, students worked on

different sequences from the same organism. Project Steps:

Find a list of all potential genes and pseudo-genes in the input DNA sequence, using start and end codons, and to arrange found sequences in two separate lists: potential genes (length is larger than 300) and pseudo-genes (length of is less than 300), in order of increasing length.

Locate the potential promoters in the given DNA sequences for each potential gene that they found in the first step, and calculated the strength of the promoter. A promoter is a region of DNA near the beginning of a gene that controls if and when the gene is actually expressed. Output: list potential genes in order of decreasing promoter strength.

BLAST all potential genes and pseudo-genes that were found, and to perform an analysis of the results.

Page 22: Interdisciplinary Introductory Course in Bioinformatics

Project Summary: For each potential and pseudo-gene: start position, length, promoter

score, BLAST results, summary and conclusion. For each sequence, we asked students to determine whether a potential gene could be a real gene based on the strength of the promoter and BLAST results.

15-minute in-class presentations: Python program, description of all Python functions that were used and the purpose of each function, all algorithms or/and programming techniques and the presentation and explanation of the summary results, including the information about the specific organism whose DNA was used as the input.

Spring 05: team project: team of CS/CIS and biology student. Programming part of the project was mostly done by the computer science students, and the biology students were required to understand and to explain the programming techniques and algorithms that were used. The project provided a possibility for truly interdisciplinary collaboration between computer science and biology students.

Spring 08: individual project.

Page 23: Interdisciplinary Introductory Course in Bioinformatics

Course Results Spring 05: no formal assessment survey was conducted. An informal discussion about the course was conducted at the end of

the semester and we asked students to provide their feedback. Students completed teaching evaluations and provided their comments there as well.

All students showed satisfaction with the course and we were very pleased to receive the request to extend the programming component of the course from almost all students. Biology students showed interest in programming and asked that an environment be created where they would be able to more fully participate in all stages of the course project.

Spring 08: a short post-survey was designed in order to assess the students’ experience which included a list of the topics that were covered in the course. We asked students to rate the level of learning for each topic on a scale of 1 (not well) to 5 (very well). Six students were enrolled in the Spring 08

Page 24: Interdisciplinary Introductory Course in Bioinformatics

Topic Average

1. Introductory Python and ability to design simple Python programs 3.7

2. Advanced Python topics: functions, loops, if-else statements, string manipulations, lists, and list manipulations

3.5

3. Designing complex Python programs using advanced Python features 3.3

4. Understanding the concept of sequence alignment: global, local, semi-global, multiple sequence alignment

3.2

5. Understanding dynamic programming algorithmic technique 3.7

6. Understanding Exhaustive Search (brute force) algorithmic technique 4

7. Understanding Needleman-Wunsch algorithm and be able to trace the algorithm to produce the final result

3.8

8. Understanding Smith-Waterman algorithm and be able to trace the algorithm to produce the final result

3.8

9. The ability to work independently on the research – based project applying computer science and biology knowledge to solve problems

4.3

10. Understanding how to use BLAST tool and to read the results of BLAST 4.2

11. Using sequence alignments to understand relatedness among species 3.8

Page 25: Interdisciplinary Introductory Course in Bioinformatics

12. Using sequence alignments forensically (HIV experiment) 4.2

13. Understanding how microarrays are used to identify gene changes in disease

3.3

14. Understanding the flow of information from DNA to protein 3.3

15. Using computer simulations to test hypotheses about disease spread

3.8

16. The ability to read a research paper in the Ethics, Computing and Genomics

3.8

17. The ability to communicate effectively through the participation in the Ethics, Computing and Genomics project

4

18. The ability to create an informative power point presentation to present the results of the Ethics, Computing and Genomics project

4.3

19. The ability to learn the topic by yourself and the ability to present results of learning in clear way

3.8

Page 26: Interdisciplinary Introductory Course in Bioinformatics

Course Results All topics were learned on an above average level Some of the topics will require our special attention and should be

revised for future iterations. Comment on the Ethics, Computing and Genomics component:

received positive feedback from most of the students. Comments regarding the course: most of the students mentioned

that they loved the course and would recommend it to their peers; they expressed their satisfaction with the level of the course and the amount of material covered and the depth of the coverage. They also mentioned that the final project was very interesting but at the same time they proposed to be more careful with the project description and to provide clear rules for finding genes on the main and complement strings to avoid confusion.

Page 27: Interdisciplinary Introductory Course in Bioinformatics

Future Plans Blend more effectively computer science and biology topics Guest speaker from the field The teaching approach will try to foster student learning

through a research-based process. Further expanding programming and algorithms component To return to team work in the project in order to enhance the

collaborative component of the course. More careful project description Bioinformatics across computer science curriculum

Introduction to computer science Design and analysis of algorithms Programming for non-majors

Page 28: Interdisciplinary Introductory Course in Bioinformatics

References 1. An Introduction to Bioinformatics Algorithms, N.C. Jones and P. A. Pevzner, The MIT Press, 20042. Fundamental Concepts of Bioinformatics, D. E. Krane and M . L.

Raymer, Publisher: Benjamin Cummings, 20023. Developing Bioinformatics Computer Skills, C. Gibas and P.

Jambeck, O’Reilly, 20014. Python/Biopython websites: http://python.org http://biopython.org5. ALGGEN – EMBER Web Resources: http://alggen.lsi.upc.es/docencia/ember/frame-ember.html6. Microbes Count! John R. Jungck, Ethel D. Stanley, Marion Field Fass. BioQUEST Curriculum Consortium. http://bioquest.org/microbescount/modules_by_tools.pdf7. Heyer, Laurie J. and Campbell, A. Malcolm, Microarray Lab: DNA

Chips: Genes to Disease. Carolina Biologicals8. Campbell, Neil, Reece, Jane (2004), Biology, Benjamin Cummings;

7th edition

Page 29: Interdisciplinary Introductory Course in Bioinformatics

BLAST The Basic Local Alignment Search Tool (BLAST) finds

regions of local similarity between sequences. The program compares nucleotide or protein sequences to

sequence databases and calculates the statistical significance of matches.

BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families.

Introduced by S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman in the early 1990s

The original BLAST algorithm searches a sequence database for maximal un-gapped local alignments.

Page 30: Interdisciplinary Introductory Course in Bioinformatics

Sequence Alignment Global Alignment: compare two sequences in their

entirety; the gap penalty is assessed regardless of whether gaps are located internally within a sequence, or at the end of one or both sequences.

The Needleman and Wunsch Algorithm

Local Alignment: find best matching subsequences within the two search sequences.

The Smith-Waterman Algorithm.

Page 31: Interdisciplinary Introductory Course in Bioinformatics

Sequence Alignment Semi-Global Alignment: different treatment of

terminal (end) gaps. Terminal Gaps are usually the result of incomplete data and do not have biological significance. Example: searching the best alignment between the short sequence and entire genome.

Modification of Needleman and Wunsch Algorithm.

Page 32: Interdisciplinary Introductory Course in Bioinformatics

Algorithm Design Techniques

Exhaustive Search (brute force) algorithm examines every possible alternative to find one particular solution

Dynamic Programming Algorithm breaks the problem into smaller sub-problems and uses the solutions of the sub-problems to construct the solution of the larger problem.

Page 33: Interdisciplinary Introductory Course in Bioinformatics

Needleman and Wunsch Algorithm

Input: two strings X = x1…xM and Y = y1…yN and scoring rules: scoring matrix s and gap penalty GP

Output: An alignment of X and Y whose score as defined by scoring rules is maximal among all possible alignments of X and Y

Page 34: Interdisciplinary Introductory Course in Bioinformatics

Let F(i, j) = optimal score of aligning x1…xi and y1…yj

Initialization: F(0,0) = 0, F(0, i) = -i, F(j, 0) = -j ( i = 1….M, j = 1….N ) Main Iteration: For each i = 1….M and j = 1….N

Termination: F(M,N) is an optimal score

)3(),1()2()1,()1(),()1,1(

max),(caseGPjiFcaseGPjiFcaseyxsjiF

jiFji

3caseifup,2caseifleft,1caseifdiagonal,

),( jiTraceBackP

Page 35: Interdisciplinary Introductory Course in Bioinformatics

Finding the optimal alignment: Every non-decreasing path from (0, 0) to (M,N)

corresponds to an global alignment of the two sequences.

Use TraceBackP starting at (M,N) to trace back an optimal alignment

case 1: xi aligns to yj

case 2: xi aligns to a gap case 3: yj aligns to a gap

3caseifup,2caseifleft,1caseifdiagonal,

),( jiTraceBackP

Page 36: Interdisciplinary Introductory Course in Bioinformatics

Global Alignment Example Find the optimal global

alignment of AACT and ACG.

Scoring rules: match = 1, mismatch = 0,

gap penalty GP = -1

A C G

0 -1 -2 -3A -1 1 0 -1A -2 0 1 0C -3 -1 1 1T -4 -2 0 1

Optimal Alignments:

Alignment 1 score = 1 A A C T | | | | - A C G

Alignment 2 score = 1 A A C T | | | | A - C G

Page 37: Interdisciplinary Introductory Course in Bioinformatics

Smith-Waterman Algorithm

Input: Strings X and Y and scoring rules: scoring matrix s and gap penalty GP.

Output: Substrings of X and Y whose global alignment, as defined by scoring rules is maximal among all global alignments of all substrings of X and Y.

Page 38: Interdisciplinary Introductory Course in Bioinformatics

Initialization: F(0,0) = 0, F(0, i) = 0, F(j, 0) = 0 ( i = 1….M, j = 1….N ) Main Iteration: For each i = 1….M and j = 1….N

Largest value of F(i, j) represents the score of the best local alignment of X and Y

Traceback begins at the highest score in the matrix and continues until you reach 0.

GPjiFGPjiF

yxsjiFjiF

ji

),1()1,(

),()1,1(0

max),(

Page 39: Interdisciplinary Introductory Course in Bioinformatics

Local Alignment Example Find the optimal local

alignment of AACT and ACG.

Scoring rules match = 1, mismatch = 0, gap penalty GP = -1

Solution: Local Alignment Score = 2 A C | | A C

A C G

0 0 0 0

A 0 1 0 0

A 0 1 1 0

C 0 0 2 1

T 0 0 1 2