BINF634 FALL15 - LECTURE 4 1
Topics Logical expression string functions: substr and index random numbers and mutation hashes Transcription, translation, genetic code Quiz 2
BINF634 FALL15 - LECTURE 4 2
Logical Value of Expressions Any expression in Perl can be
interpreted as a logical value true or false
Scalar context: false if 0, "", or undefined otherwise true
List context: false if () or undefined true otherwise
my $x; # or: my $x = 0;if ($x) { $x++ }else { $x = 17 }print "$x\n";17
my $x = 2;while ($x) {
print $x--, "\n";}21
my @a = ("A", "T");while (@a) {
print shift @a, "\n";}AT
BINF634 FALL15 - LECTURE 4 3
Logical Operators
$a and $b $a if $a is false, $b otherwise
$a or $b $a if $a is true, $b otherwise
not $a true if $a is not true
$a xor $b True if $a or $b is true, but not both
Logical operator "short-circuit": only evaluate second argument if necessary
Example:
open(GRADES, "grades") or die "Can't open file grades\n";
See Wall (p 109-110) for discussion of C-style &&, || and ! operators
BINF634 FALL15 - LECTURE 4 4
String Functions: substr
substr EXPR, OFFSET, LENGTH, REPLACEMENTReturn substring of string EXPR at position OFFSET and length LENGTH
$s = "hello, world!";$x = substr $s, 1, 1; # $x = "e"$x = substr $s, 0, 3; # $x = "hel"$x = substr $s, 7; # $x = "world!"
# The substring is replaced by REPLACEMENT if used:substr($s,0,5,"goodbye"); # $s = "goodbye, world!";
# This does the same thing:$s = "hello, world!";substr($s,0,5) = "goodbye";
Note: Predefined Perl functions may be used with or without parentheses around their arguments
# From example7-4.pl
# matching_percentage## Subroutine to calculate the percentage of identical bases in two# equal length DNA sequences
sub matching_percentage {my($string1, $string2) = @_;
# we assume that the strings have the same length my $length = length($string1); my $count = 0;
for (my $position=0; $position < $length ; ++$position) { if(substr($string1,$position,1) eq substr($string2,$position,1)) { ++$count; } } return $count / $length;}
5BINF634 FALL15 - LECTURE 4
#!/usr/bin/perluse strict;use warnings;my $dna = "AAAAATTTTTGGGGGTTTTT";print_sequence(dna,8);exit;
# A subroutine to format and print sequence data
sub print_sequence { my($sequence, $length) = @_;
# Print sequence in lines of $length for (my $pos = 0 ; $pos < length($sequence); $pos += $length ) { print substr($sequence, $pos, $length), "\n"; }}
AAAAATTTTTGGGGGCCCCC
6BINF634 FALL15 - LECTURE 4
BINF634 FALL15 - LECTURE 4 7
String Functions: index
index STR, SUBSTR, OFFSETReturn the position of the first occurrence of SUBSTR in STR; If OFFSET is given, skip this many letters before lookingReturns -1 if SUBSTR not found
$dna = "GATGCCATGAAATGC";$pos = index $dna, "ATG";print "ATG found at position $pos\n"; # answer: 1
pos = -1;while (($pos = index($dna, "ATG", $pos)) > -1) {
print "ATG found at position $pos\n";$pos++;
}OUTPUT:ATG found at position 1ATG found at position 6ATG found at position 11
BINF634 FALL15 - LECTURE 4 8
Random Number Generation
$r = rand; # $r is random between 0 and 1 (0<=$r< 1.0)
$r = rand(100); # random between 0 and 100
$d = int rand(101); # random integer from 0 to 100
srand($seed); # seeds the random number generator
If a program doesn't call srand(), then it generates different random numbers each time it is run.
If program does call srand($seed), we get the same sequence of random number each time the program is run with the same value for $seed.
BINF634 FALL15 - LECTURE 4 9
An Example to Generate Random Stories - I#!/usr/bin/perl -w# Example 7-1 Children's game, demonstrating primitive artificial intelligence,# using a random number generator to randomly select parts of sentences.
use strict;use warnings;
# Declare the variablesmy $count;my $input;my $number;my $sentence;my $story;
# Here are the arrays of parts of sentences:my @nouns = ('Dad','TV','Mom','Groucho','Rebecca','Harpo','Robin Hood','Joe and Moe',);
BINF634 FALL15 - LECTURE 4 10
An Example to Generate Random Stories - IImy @verbs = ('ran to','giggled with','put hot sauce into the orange juice
of','exploded','dissolved','sang stupid songs with','jumped with',);
my @prepositions = ('at the store','over the rainbow','just for the fun of it','at the beach','before dinner','in New York City',
'in a dream','around the world',);
# Seed the random number generator.# time|$$ combines the current time
with the current process id# in a somewhat weak attempt to come up
with a random seed.srand(time|$$);
# This do-until loop composes six-sentence "stories".
# until the user types "quit".do { # (Re)set $story to the empty
string each time through the loop $story = '';
# Make 6 sentences per story. for ($count = 0; $count < 6;
$count++) {
BINF634 FALL15 - LECTURE 4 11
An Example to Generate Random Stories - III# Notes on the following statements: # 1) scalar @array gives the
number of elements in the array. # 2) rand returns a random number
greater than 0 and # less than scalar(@array). # 3) int removes the fractional
part of a number. # 4) . joins two strings
together. $sentence =
$nouns[int(rand(scalar @nouns))] . " " .
$verbs[int(rand(scalar @verbs))] . " " .
$nouns[int(rand(scalar @nouns))] . " " .
$prepositions[int(rand(scalar @prepositions))]
. '. ';
$story .= $sentence;
}
# Print the story. print "\n",$story,"\n";
# Get user input. print "\nType \"quit\" to quit, or
press Enter to continue: ";
$input = <STDIN>;
# Exit loop at user's request} until($input =~ /^\s*q/i);
exit;
BINF634 FALL15 - LECTURE 4 12
Randomly Selecting and Element of an Array $verbs[int(rand(scalar @verbs)))]
verbs[int(rand(7))] #Why??
What does rand(7) do?
How about int(rand(7))?
BINF634 FALL15 - LECTURE 4 13
An Alternative Way to Randomly Select the Elements in an Array $verbs[int rand scalar @verbs]
This will actually work $verbs[rand @verbs] #How does this work?
# From example7-2.pl# A subroutine to perform a mutation in a string of DNA#
sub mutate { my($dna) = @_;
my(@nucleotides) = ('A', 'C', 'G', 'T');
# Pick a random position in the DNA my($position) = randomposition($dna);
# Pick a random nucleotide my($newbase) = randomnucleotide(@nucleotides);
# Insert the random nucleotide into the random position in the DNA substr($dna,$position,1,$newbase);
return $dna;}
14BINF634 FALL15 - LECTURE 4
# A subroutine to randomly select a position in a string.sub randomposition { my($string) = @_; # rand returns a decimal number between 0 and its argument. # int returns the integer portion of a decimal number. # # The whole expression returns a random number between 0 and # length-1 return int rand length $string;}
# Select at random one of the four nucleotidessub randomnucleotide { my(@nucleotides) = ('A', 'C', 'G', 'T'); return randomelement(@nucleotides);}
# randomly select an element from an arraysub randomelement { my(@array) = @_;
# return $array[int(rand(scalar @array)]; return $array[rand @array];}
15BINF634 FALL15 - LECTURE 4
# Subroutine to perform a mutation in a string of DNA-version 2, in# which it is guaranteed that one base will change on each call.
sub mutate_better { my($dna) = @_; my(@nucleotides) = ('A', 'C', 'G', 'T');
# Pick a random position in the DNA my($position) = randomposition($dna);
# Pick a random nucleotide my($newbase);
do { $newbase = randomnucleotide(@nucleotides);
# Make sure it's different than the nucleotide we're mutating }until ( $newbase ne substr($dna, $position,1) );
# Insert the random nucleotide into the random position in the DNA substr($dna,$position,1,$newbase);
return $dna;}
16BINF634 FALL15 - LECTURE 4
BINF634 FALL15 - LECTURE 4 17
Hashes(Associative Arrays)
A Hash is a collection of zero or more pairs of scalar values,called keys and values
Hash variable names begin with a percent sign (%)%genes = ( "gene1", "ATTCGT", "gene2", "CTGCCATGA");
The values are indexed by the keys Given a key, the hash returns the corresponding value$seq = $genes{"gene2"}; # $seq = "CTGCCATGA" Note that $genes{"gene2"} is a scalar, so it starts with $
BINF634 FALL15 - LECTURE 4 18
Hashes
Hashes can be assigned values use key=>value notation:%genes = ( "gene1", "ATTCGT", "gene2", "CTGCCATGA");
%genes = ( gene1=>"ATTCGT", gene2=>"CTGCCATGA");
Hash elements can be created/altered by assignment statements:$genes{"gene1"} = "ATTCGT";
$genes{gene2} = "CTGCCATGA"; # note: no quotes in key
BINF634 FALL15 - LECTURE 4 19
Hashes (Associative Arrays)
%genomes = ( ); # creates an empty hash
# two ways to do the same thing:
%genomes = ( "virus", 31, "bacteria", 89, "plants", 5 );
%genomes = ( virus => 31, bacteria => 89, plants => 5 );
$genomes{mammals} = 2; # adds a new pair to the hash
@genome_list = keys %genomes;
# @genome list is now ("plants" , "mammals", "bacteria", "virus")
@genome_counts = values %genomes;
# @genome_counts is now (5, 2, 89, 31)
# keys and values are not guaranteed to return the data is same order
# as it was entered, but they are guaranteed to return the data in the
# same order as each other.
BINF634 FALL15 - LECTURE 4 20
HashesThe keys function returns a list of all keys in a hash (in some random order)
%genes = (gene2=>"CTGCCATGA", gene1=>"ATTCGT");
@key_list = keys(%genes);
print "@key_list\n"; # prints: gene1 gene2
# often used to loop through a hash:
foreach $key (@key_list) {
print "The value of $key is $genes{$key}\n";
}
Output:
The value of gene1 is ATTCGT
The value of gene2 is CTGCCATGA
BINF634 FALL15 - LECTURE 4 21
Hashes# exists($H{$key}) returns TRUE if $key occurs in hash %H# print all the words in a file with their countsmy ($file) = @ARGV;open FH, $file;my @lines = <FH>;close FH;my %count = ();for my $line (@lines) {
for my $word (split " ", $line) {if (not exists $count{$word}) {
# initialize the count for a new word $count{$word} = 1;
}else {
# update the count for an existing word$count{$word}++;
}}
}for my $key (sort keys %count) {
print "$key $count{$key}\n";}
What does each line in this program do?
Hashes% cat testfile2a text file with lots of wordssome words occur once and somewords occur more than once
% wordcount.pl testfile2a 1and 1file 1lots 1more 1occur 2of 1once 2some 2text 1than 1with 1words 3
22BINF634 FALL15 - LECTURE 4
BINF634 FALL15 - LECTURE 4 23
Hashes (Associative Arrays)
The each function returns a two-element list containing one key from the hash and its associated value
Subsequent calls to each will return another pair, until all pairs have been returned (at which point an empty array is returned)
while ( ($genome, $count) = each %genomes ) ) {
print “$genome $count\n”;}
OUTPUT: (possibly not in this order)plants 5virus 31bacteria 89mammals 2
BINF634 FALL15 - LECTURE 4 24
More on Hashes (Associative Arrays)
Assigning the return value from values or keys to a scalar gives the number of pairs in a hash:
$genome_count = keys %genomes; # $genome_count is now 4$genome_count = values %genomes; # $genome_count is now 4
The delete function removes a pair from a hash
delete $genomes{bacteria};$genome_count = keys %genomes; # $genome_count is now 3
BINF634 FALL15 - LECTURE 4 25
Transcription and Translation DNA is transcribed to mRNA, mRNA is translated to protein
Start with double stranded DNAATTCGAGCATGACATCATCGGTA (sense strand)TAAGCTCGTACTGTAGTAGCCAT (complement strand)
DNA double helix separates, allowing polymerase to transcribe one strand:ATTCGAGCATGACATCATCGGTA (sense strand)AUUCGAGCAUGACAUCAUCGGUA (mRNA)| | | | | | |TAAGCTCGTACTGTAGTAGCCAT (complement strand)
mRNA codons translated into protein:AUU CGA GCA UGA CAU CAU CGG … (mRNA)
BINF634 FALL15 - LECTURE 4 27
Reading Frames
A genomic sequence has 6 reading frames, corresponding to the six possible ways of translating the sequence into three-letter codons.
Frame 1 treats each group of three bases as a codon, starting from the first base
Frame 2 starts at the second base Frame 3 starts at the third base
Start with the sense strand of DNA: ATTCGAGCATGACATCATCGGTA Reading frame 1: ATT CGA GCA TGA CAT CAT CGG TA Reading Frame 2: TTC GAG CAT GAC ATC ATC GGT A Reading Frame 3: TCG AGC ATG ACA TCA TCG GTA
Each reading frame can be translated into a different protein sequence Frames 4, 5 and 6 are defined in a similar way, but refer to the opposite strand,
which is the reverse complement of the first strand.
BINF634 FALL15 - LECTURE 4 28
Reading Frames Perl code to process first three reading frames:
$dna = ...;for (my $r = 1; $r <= 3; $r++) {
my $frame = substr($dna, ?, ?); # fill in the blanks!# process reading frame $r here
}
# exercise: write a subroutine to return the nth reading frame for a DNA string
BINF634 FALL15 - LECTURE 4 30
Using the Genetic Code
We’ll look at four versions of translating DNA using the genetic code:
Look up the codon using if-then-else Same as above, but use patterns to reflect redundancy of
genetic code Use a hash to look up each codon Same as above, but more efficiently
See BeginPerlBioinfo.pm
# codon2aa# A subroutine to translate a DNA 3-character codon to an amino acid
sub codon2aa_v1 { my($codon) = @_;
if ( $codon =~ /TCA/i ) { return "S" } # Serine elsif ( $codon =~ /TCC/i ) { return "S" } # Serine elsif ( $codon =~ /TCG/i ) { return "S" } # Serine elsif ( $codon =~ /TCT/i ) { return "S" } # Serine elsif ( $codon =~ /TTC/i ) { return "F" } # Phenylalanine elsif ( $codon =~ /TTT/i ) { return "F" } # Phenylalanine elsif ( $codon =~ /TTA/i ) { return "L" } # Leucine elsif ( $codon =~ /TTG/i ) { return "L" } # Leucine elsif ( $codon =~ /TAC/i ) { return "Y" } # Tyrosine elsif ( $codon =~ /TAT/i ) { return "Y" } # Tyrosine elsif ( $codon =~ /TAA/i ) { return "_" } # Stop elsif ( $codon =~ /TAG/i ) { return "_" } # Stop elsif ( $codon =~ /TGC/i ) { return "C" } # Cysteine elsif ( $codon =~ /TGT/i ) { return "C" } # Cysteine elsif ( $codon =~ /TGA/i ) { return "_" } # Stop elsif ( $codon =~ /TGG/i ) { return "W" } # Tryptophan . . elsif ( $codon =~ /GGT/i ) { return "G" } # Glycine else { print STDERR "Bad codon \"$codon\"!!\n"; exit; }}
Problem: takes longer to look up codons near end of list.
Version 1
31
BINF634 FALL15 - LECTURE 4
Remember in relating to the translation table on slide 29 that the T’s become U’s
# codon2aa# A subroutine to translate a DNA 3-character codon to an amino acid
sub codon2aa_v2 { my($codon) = @_;
if ( $codon =~ /GC./i) { return "A" } # Alanine elsif ( $codon =~ /TG[TC]/i) { return "C" } # Cysteine elsif ( $codon =~ /GA[TC]/i) { return "D" } # Aspartic Acid elsif ( $codon =~ /GA[AG]/i) { return "E" } # Glutamic Acid elsif ( $codon =~ /TT[TC]/i) { return "F" } # Phenylalanine elsif ( $codon =~ /GG./i) { return "G" } # Glycine elsif ( $codon =~ /CA[TC]/i) { return "H" } # Histidine elsif ( $codon =~ /AT[TCA]/i) { return "I" } # Isoleucine elsif ( $codon =~ /AA[AG]/i) { return "K" } # Lysine elsif ( $codon =~ /TT[AG]|CT./i) { return "L" } # Leucine elsif ( $codon =~ /ATG/i) { return "M" } # Methionine elsif ( $codon =~ /AA[TC]/i) { return "N" } # Asparagine elsif ( $codon =~ /CC./i) { return "P" } # Proline elsif ( $codon =~ /CA[AG]/i) { return "Q" } # Glutamine elsif ( $codon =~ /CG.|AG[AG]/i) { return "R" } # Arginine elsif ( $codon =~ /TC.|AG[TC]/i) { return "S" } # Serine elsif ( $codon =~ /AC./i) { return "T" } # Threonine elsif ( $codon =~ /GT./i) { return "V" } # Valine elsif ( $codon =~ /TGG/i) { return "W" } # Tryptophan elsif ( $codon =~ /TA[TC]/i) { return "Y" } # Tyrosine elsif ( $codon =~ /TA[AG]|TGA/i) { return "_" } # Stop else { print STDERR "Bad codon \"$codon\"!!\n"; exit; }}
Still takes longer to look up codons near end of list.
Version 2
32
BINF634 FALL15 - LECTURE 4
# codon2aa# Translate codon using hash lookup (pp 160-162 and BeginPerlBioinfo.pm)
sub codon2aa_v3 { my($codon) = @_;
$codon = uc $codon;
my(%genetic_code) = ( "TCA" => "S", # Serine "TCC" => "S", # Serine "TCG" => "S", # Serine "TCT" => "S", # Serine "TTC" => "F", # Phenylalanine "TTT" => "F", # Phenylalanine "TTA" => "L", # Leucine "TTG" => "L", # Leucine "TAC" => "Y", # Tyrosine "TAT" => "Y", # Tyrosine "TAA" => "_", # Stop "TAG" => "_", # Stop
. . . "GGT" => "G", # Glycine );
if(exists $genetic_code{$codon}) { return $genetic_code{$codon}; }
else{ die "Bad codon: $codon";}}
Efficiency Problem: the hash is redefined for every codon lookup.
Version 3
Robustness Problem: should the program die if the codon is unknown?
33BINF634 FALL15 - LECTURE 4
BINF634 FALL15 - LECTURE 4 34
More Efficient Solution
Define the genetic code hash just once (outside subroutine) Use subroutine to perform lookup
my(%genetic_code) = ( "TCA" => "S", # Serine "TCC" => "S", # Serine "TCG" => "S", # Serine "TCT" => "S", # Serine "TTC" => "F", # Phenylalanine "TTT" => "F", # Phenylalanine "TTA" => "L", # Leucine "TTG" => "L", # Leucine "TAC" => "Y", # Tyrosine "TAT" => "Y", # Tyrosine "TAA" => "_", # Stop "TAG" => "_", # Stop
. . . "GGA" => "G", # Glycine "GGC" => "G", # Glycine "GGG" => "G", # Glycine "GGT" => "G", # Glycine );
# codon2aa# Translate a DNA 3-character codon to an amino acid# using hash lookup passed as argument
sub codon2aa { my($codon) = @_;
$codon = uc $codon;
if (exists $genetic_code{$codon}) { return $genetic_code{$codon}; } else {
return "X";# alternative error response:# die "Bad codon: $codon";
}}
The exists function return true if an element with the given key occurs in the hash.
Version 4
35BINF634 FALL15 - LECTURE 4
BINF634 FALL15 - LECTURE 4 36
Wrap up Program #1 was due tonight. Please talk to me privately
either via email or in person if you did not turn it in.
Program #2 is posted on course website and will be submitted via Blackboard
hint: Read about BeginPerlBioinfo.pm in chapter 8 Read the code in BeginPerlBioinfo.pm Start NOW!
Read: Tisdall Ch. 9 Wall Ch. 5