computer programming for biologists

39
Computer Programming for Biologists Class 5 Nov 20 st , 2014 Karsten Hokamp tp://bioinf.gen.tcd.ie/GE3M25/programmi

Upload: adolph

Post on 19-Jan-2016

42 views

Category:

Documents


0 download

DESCRIPTION

Computer Programming for Biologists. Class 5 Nov 21 st , 2013 Karsten Hokamp http://bioinf.gen.tcd.ie/GE3M25. Computer Programming for Biologists. Overview. Program Exit Test Submission Random numbers Regular Expressions. Computer Programming for Biologists. Exiting a program. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Computer Programming for Biologists

Computer Programming for Biologists

Class 5

Nov 20st, 2014

Karsten Hokamp

http://bioinf.gen.tcd.ie/GE3M25/programming

Page 2: Computer Programming for Biologists

Computer Programming for Biologists

Project

Program Exit

Random numbers

Regular Expressions

Overview

Page 3: Computer Programming for Biologists

Computer Programming for Biologists

Task 1: Report length of a sequence in Fasta format

Understand the problem, consider input/output:

>Tmsb10ATGGCAGACAAGCCGGACATGGGGGAAATCGCCAGCTTCGATAAGGCCAAGCTGAAGAAAACCGAGACGCAGGAGAAGAACACCCTGCCGACCAAAGAGACCATTGAACAGGAAAAGAGGAGTGAAATCTCCTAA

Sequence length is 135 bp.

Project

Page 4: Computer Programming for Biologists

Computer Programming for Biologists

Problems:

1.File contains header line

2.Sequence contains line-breaks

>Tmsb10ATGGCAGACAAGCCGGACATGGGGGAAATCGCCAGCTTCGATAAGGCCAAGCTGAAGAAAACCGAGACGCAGGAGAAGAACACCCTGCCGACCAAAGAGACCATTGAACAGGAAAAGAGGAGTGAAATCTCCTAA

Project

Page 5: Computer Programming for Biologists

Computer Programming for Biologists

Steps:

1.Read in file content (line-by-line)

2. Remove line-breaks

3. Skip header line

4. Concatenate sequence into one long string

5. Calculate and report length

Project

Page 6: Computer Programming for Biologists

Computer Programming for Biologists

Steps:

# 1. Read in file content (line-by-line)

while ($input = <>) {

}

Project

Page 7: Computer Programming for Biologists

Computer Programming for Biologists

Steps:

# 1. Read in file content (line-by-line)

while ($input = <>) {

# 2. Remove line-breaks

# 3. Skip header line

# 4. Concatenate sequence into one long

string

}

Project

Page 8: Computer Programming for Biologists

Computer Programming for Biologists

Steps:

# 1. Read in file content (line-by-line)

while ($input = <>) {

# 2. Remove line-breaks

chomp $input;

# 3. Skip header line

# 4. Concatenate sequence into one long

string

}

Project

Page 9: Computer Programming for Biologists

Computer Programming for Biologists

Steps:

# 1. Read in file content (line-by-line)

while ($input = <>) {

# 2. Remove line-breaks

chomp $input;

# 3. Skip header line

# 4. Concatenate sequence into one long

string

$sequence .= $input;

}

Project

Page 10: Computer Programming for Biologists

Computer Programming for Biologists

# 1. Read in file content (line-by-line)

while ($input = <>) {

# 2. Remove line-breaks

chomp $input;

# 3. Skip header line

# 4. Concatenate sequence into one long string

$sequence .= $input;

}

# 5. Calculate and report length

$length = length($sequence);

print "Sequence length: $length bp\n";

Project

Page 11: Computer Programming for Biologists

Computer Programming for Biologists

# 1. Read in file content (line-by-line)

while ($input = <>) {

# 2. Remove line-breaks

chomp $input;

# 3. Skip header line (check for '>' in first position)

# extract first character:

$first = substr $input, 0, 1;

# is it a '>'?

if ($first eq '>') {

# skip this line

next;

}

$sequence .= $input;

Project

Page 12: Computer Programming for Biologists

Computer Programming for Biologists

# 1. Read in file content (line-by-line)

while ($input = <>) {

# 2. Remove line-breaks

chomp $input;

# 3. Skip header line (check for '>' in first position)

# extract first character:

$first = substr $input, 0, 1;

# is it a '>'?

if ($first eq '>') {

# skip this line

next;

}

$sequence .= $input;

Project

Page 13: Computer Programming for Biologists

Computer Programming for Biologists

# 1. Read in file content (line-by-line)

while ($input = <>) {

# 2. Remove line-breaks

chomp $input;

# 3. Skip header line (check for '>' in first position)

# extract first character:

$first = substr $input, 0, 1;

# is it a '>'?

unless ($first eq '>') {

# this must be part of the sequence

$sequence .= $input;

}

}

Project

alternativeversionalternativeversion

Page 14: Computer Programming for Biologists

Computer Programming for Biologists

# 1. Read in file content (line-by-line)while ($input = <>) {

# 2. Remove line-breakschomp $input;# 3. Skip header line (check for '>' in first position)# extract first character:$first = substr $input, 0, 1;# is it a '>'?if ($first eq '>') {

# skip this linenext;

}# 4. Concatenate sequence into one long string$sequence .= $input;

}# 5. Calculate and report length$length = length($sequence);print "Sequence length: $length bp\n";

Project

Page 15: Computer Programming for Biologists

Computer Programming for Biologists

# Suggestions for the start of the script:

# make sure a file has been providedunless (@ARGV) {

die "Please specify file name on command line!";}

# initialise sequence variable$sequence = '';

# 1. Read in file content (line-by-line)while ($input = <>) {

Project

Page 16: Computer Programming for Biologists

Computer Programming for Biologists

1. automatic exit at end of script

2. explicit exit with value:

exit 0; # default

or

exit 1; # normally indicates an error

3. exit on failure:

die "error message";

("\n" supresses line number)

Exiting a program

Page 17: Computer Programming for Biologists

Computer Programming for Biologists

Example:

Exiting a program

Page 18: Computer Programming for Biologists

Computer Programming for Biologists

Practical:

Project

http://bioinf.gen.tcd.ie/GE3M25/programming/class5

Page 19: Computer Programming for Biologists

Computer Programming for Biologists

• constructs that describe patterns

• powerful methods for text processing

• search for patterns in a string

• search and extract patterns

• search and replace patterns

• pattern at which to split a string

Regular Expressions

Page 20: Computer Programming for Biologists

Computer Programming for Biologists

Examples:

• Look for a motif in a dna/protein sequence

• Find low complexity repeats and mask with x's

• Find start of sequence string in GenBank record

• Extract e-mail addresses from a web-page

• Replace strings, e.g.: '@tcd.ie' with '@gmail.com'

Regular Expressions

Page 21: Computer Programming for Biologists

Computer Programming for Biologists

Find a pattern in a string (stored in a variable):

$sequence = 'ataggctagctaga';

if ( $sequence =~ /ctag/ ) { print 'Found!';}

Regular Expressions

string in which to

search

Page 22: Computer Programming for Biologists

Computer Programming for Biologists

Find a pattern in a string (stored in a variable):

$sequence = 'ataggctagctaga';

if ( $sequence =~ /ctag/ ) { print 'Found!';}

Regular Expressions

binding operator

Page 23: Computer Programming for Biologists

Computer Programming for Biologists

Find a pattern in a string (stored in a variable):

$sequence = 'ataggctagctaga';

if ( $sequence =~ /ctag/ ) { print 'Found!';}

Regular Expressions

pattern

Page 24: Computer Programming for Biologists

Computer Programming for Biologists

Find a pattern in a string (stored in a variable):

$sequence = 'ataggctagctaga';

if ( $sequence =~ /ctag/ ) { print 'Found!';}

Regular Expressions

delimiters

Page 25: Computer Programming for Biologists

Computer Programming for Biologists

Find a pattern in a string (stored in a variable):

$sequence = 'ataggctagctaga';

if ( $sequence =~ /ctag/ ) { print 'Found!';}

Regular Expressions

binding operator pattern

delimitersstring in which to

search

Page 26: Computer Programming for Biologists

Computer Programming for Biologists

Find a pattern in a string (stored in a variable):

$_ = 'ataggctagctaga';

if ( /ctag/ ) { print 'Found!';}

Regular Expressions

pattern

delimiters

without binding // to a variable, regular expression works on $_

Page 27: Computer Programming for Biologists

Computer Programming for Biologists

Search modifier:

i = make search case-insensitive

$sequence = 'ataggctagctaga';

if ( $sequence =~ /TAG/i ) {

print 'Found!';

}

Regular Expressions

Page 28: Computer Programming for Biologists

Computer Programming for Biologists

Metacharacters:

^ = match at the beginning of a line

$ = match at the end of the line

. = match any character (except newline)

\ = escape the next metacharacter

$sequence = ">sequence1\natgacctggaataggat";

if ( $sequence =~ /^>/ ) { # line starts with '>'

print 'Found Fasta header!';

}

Regular Expressions

/\.$/ matches dot at end of line

Page 29: Computer Programming for Biologists

Computer Programming for Biologists

Exercise:

Modify your course project (sequanto.pl) to use a

regular expression for detection of a header line

instead of 'substr' and 'eq' to check first character.

Project

Page 30: Computer Programming for Biologists

Computer Programming for Biologists

Matching repetition:

a? = match 'a' 1 or 0 times

a* = match 'a' 0 or more times, i.e., any number of times

a+ = match 'a' 1 or more times, i.e., at least once

a{n,m} = match at least "n" times, but not more than "m" times.

a{n,} = match at least "n" or more times

a{n} = match exactly "n" times

$sequence =~ /a{5,}/; # finds repeats of 5 or more 'a's

Regular Expressions

Page 31: Computer Programming for Biologists

Computer Programming for Biologists

Search for classes of characters

\d = match a digit character

\w = match a word character (alphanumeric and '_')

\D = match a non-digit character

\W = match a non-word character

\s = whitespace

\S = match a non-whitespace character

$date = '30 Jan 2009';

if ( date =~ /\d{1,2} \w+ \d{2,4}/ ) {

print 'Correct date format!';

}

Regular Expressions

also matches '1 February 09'

Page 32: Computer Programming for Biologists

Computer Programming for Biologists

Match special characters

\t = matches a tabulator (tab)

\b = matches a word boundary

\r = matches return

\n = matches UNIX newline

\cM = matches Control-M (line-ending in Windows)

while (my $line = <>) {

if ($line =~ /\cM/) {

warn "Windows line-ending detected!";

}

}

Regular Expressions

Page 33: Computer Programming for Biologists

Computer Programming for Biologists

Search for range of characters

[ ] = match at least one of the characters specified within these brackets

- = specifies a range, e.g. [a-z], or [0-9]

^ = match any character not in the list, e.g. [^A-Z]

$sequence = 'ataggctapgctaga';

if ( $sequence =~ /[^acgt]/ ) {

print "Sequence contains non-DNA character: $&";

}

Regular Expressions

$& is a special variable containing the last pattern match$` and $' contain strings before and after match

Page 34: Computer Programming for Biologists

Computer Programming for Biologists

Search and replace (substitute):

s/pattern1/pattern2/

$sequence = 'ataggctagctaga';

$rna = $sequence;

$rna =~ s/t/u/;

-> 'auaggctagctaga'

Regular Expressions

Only the first match will be replaced!

Page 35: Computer Programming for Biologists

Computer Programming for Biologists

Modifiers for substitution:

i = case in-sensitive

g = global

s = match includes newline

$sequence = 'ataggctagctaga';

$rna = $sequence;

$rna =~ s/t/u/g;

-> 'auaggcuagcuaga'

Regular Expressions

replaces all 't' in the line with 'u'

Page 36: Computer Programming for Biologists

Computer Programming for Biologists

Example: Clean up a sequence string:

$sequence = "

1 ataggctagctagat

16 ttagagctagta

";

$sequence =~ s/[^actg]//g;

-> 'ataggctagctagatttagagctagta'

Regular Expressions

Deletes everything that is not a, c, t, or g.

Page 37: Computer Programming for Biologists

Computer Programming for Biologists

Extract matched patterns:

- put patterns in parentheses

- \1, \2, \3, … refers back to ()'s within pattern match

- $1, $2, $3, … refers back to ()'s after pattern match

$sequence = ">test\natgtagagctagta";

if ($sequence =~ /^>(.*)/) { $id = $1; }

or

$email =~ s/(.*)\@(.*)\.(.*)/\1 at \2 dot \3/;

print "Changed address to $1 at $2 dot $3\n";

Regular Expressions

changes '[email protected]' to 'kahokamp at tcd dot ie''

Page 38: Computer Programming for Biologists

Computer Programming for Biologists

Practical:

Project

http://bioinf.gen.tcd.ie/GE3M25/programming/class5

Page 39: Computer Programming for Biologists

Computer Programming for Biologists

Change a character into an array:

@array = split //, $string;

Split input line at tabs:

@columns = split /\t/, $input_line;

Default splits $_ on whitespace:

while (<>) {

@colums = split;

}

Regular Expressions in split