binf 634 fall 2013 lecture 8 modules and maps1 thanks to john grefenstette for many of these slides...
TRANSCRIPT
BINF 634 FALL 2013 LECTURE 8 Modules and Maps 1
Thanks to John Grefenstette for Many of These Slides !!
Topics Midterm Discussions Program 2 Discussions Perl Modules Module Getopt::Std Range Operator Restriction Maps Parsing Rebase File Quiz 4
BINF 634 FALL 2013 LECTURE 8 Modules and Maps 4
Perl Modules - I You can put useful subroutines into modules for future
use Only write and debug subroutines once
Creating a module: Put subroutines into a file with extension .pm, e.g. MySubs.pm
Include "1;" as last line in file
BINF 634 FALL 2013 LECTURE 8 Modules and Maps 5
Perl Modules - II Using a module:
Include statement:use Mysubs;
note: leave off the .pm from the module name in use statement
Then use any subroutine defined in MySubs.pm Modules are included at compile time
Example: BeginPerlBioinformatics.pm contains subroutines from
Tisdall textbook (see web page) http://oreilly.com/catalog/9780596000806/
BINF 634 FALL 2013 LECTURE 8 Modules and Maps 6
Perl Modules
Where does Perl find the modules? Perl looks in the list of directories in built-in array @INC
includes paths to standard modules such as strict.pm and warnings.pm
@INC automatically includes the current directory
You can prepend directories to @INC by using the -I switch% perl -I"/mypath/libdir" prog.pl
You also can tell Perl where to look as follows: use lib "/mypath";use MyFASTASubs; # file is: "/mypath/MyFASTASubs.pm"use MyRandomSubs;
% cat RAND.pm## subroutines for using random numbers#
sub random_uniform { my ($lower, $upper) = @_; return $lower + rand ($upper - $lower);};
sub random_int { my ($lower, $upper) = @_; return int($lower + rand ($upper - $lower + 1));}
sub shuffle_array { my (@a) = @_; my $length = scalar @a; for (my $i = 0; $i < $length; $i++) { my $j = random_int($i, $length-1); ($a[$i], $a[$j]) = ($a[$j], $a[$i]); } return @a; }
sub shuffle_string { my ($s) = @_; my @a = split "", $s; return join "", shuffle_array(@a);}
1;7BINF 634 FALL 2013 LECTURE 8 Modules and Maps
#!/usr/bin/perluse strict;use warnings;# file: test_myrand.pluse lib 'C:\Documents and Settings\Owner\workspace\binf634_book_examples';use RAND;my $dna = "AAAAAAAATTTTTTTTTTGGGGGGGGCCCCCCCCC";print "$dna\n"; $dna = shuffle_string($dna);print "$dna\n";
% test_myrand.plAAAAAAAATTTTTTTTTTGGGGGGGGCCCCCCCCCGCTACGCAGAGGTTAGTCATCTCGACATACTGCTT
% test_myrand.plAAAAAAAATTTTTTTTTTGGGGGGGGCCCCCCCCCTCTTCGGAACCTACGACGTTCTATACAATGCTGGG
8BINF 634 FALL 2013 LECTURE 8 Modules and Maps
BINF 634 FALL 2013 LECTURE 8 Modules and Maps 9
use Getopt::Std;Suppose prog.pl contains the statements: use Getopt::Std;my %opts = (); # a hash to hold the optionsgetopts('bnf:', \%opts);
# -b & -n are boolean flags# -f takes an argument (indicated by ":")
Hash keys will be x (where x is the switch name) with key values the value of the argument or 1 if no argument is specified.The following set %opts = (n=>1, f=>"infile")% prog.pl -f infile -n% prog.pl -nf infile% prog.pl -n -f infile
#!/usr/bin/perluse strict;# file: opt.pluse Getopt::Std;my %options=(); # empty hash of options
# "d:" means d takes an argumentgetopts("od:fF",\%options);
print "-o $options{o}\n" if defined $options{o};print "-d $options{d}\n" if defined $options{d};print "-f $options{f}\n" if defined $options{f};print "-F $options{F}\n" if defined $options{F};
print "Unprocessed by Getopt::Std:\n" if @ARGV;foreach (@ARGV) { print "$_\n";}
% opt.pl -Fd5 -o 123 foo-o 1-d 5-F 1Unprocessed by Getopt::Std:123 foo
10BINF 634 FALL 2013 LECTURE 8 Modules and Maps
processing of options stopped whenit saw 123
BINF 634 FALL 2013 LECTURE 8 Modules and Maps 11
Restriction Maps (Tisdall Ch. 9) Restriction Enzymes are “chemical scissors” that cut DNA
in specific places specified by short sequence patterns HindII binds and cuts at GTY^RAC Y = C or T (pyrimidines) R = A or G (purines) ^ indicates cleave site
A Restriction Map of a DNA sequence shows all the positions at which a given Restriction Enzyme will cut
Several hundred restriction enzymes are known Database REBASE: http://www.neb.com
REBASE version 104 bionet.104 =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= REBASE, The Restriction Enzyme Database http://rebase.neb.com Copyright (c) Dr. Richard J. Roberts, 2001. All rights reserved. =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Rich Roberts Mar 30 2001 AaaI (XmaIII) C^GGCCGAacI (BamHI) GGATCCAaeI (BamHI) GGATCCAagI (ClaI) AT^CGATAaqI (ApaLI) GTGCACAarI CACCTGCNNNN^AarI ^NNNNNNNNGCAGGTGAatI (StuI) AGG^CCTAatII GACGT^CAauI (Bsp1407I) T^GTACAAbaI (BclI) T^GATCAAbeI (BbvCI) CC^TCAGCAbeI (BbvCI) GC^TGAGGAbrI (XhoI) C^TCGAGAcaI (AsuII) TTCGAAAcaII (BamHI) GGATCCAcaIII (MstI) TGCGCAAcaIV (HaeIII) GGCCAccI GT^MKAC
# The IUB ambiguity codesR = G or A Y = C or T M = A or CK = G or TS = G or C W = A or T B = not A (C or G or T)D = not C (A or G or T) H = not G (A or C or T)V = not T (A or C or G) N = A or C or G or T
12BINF 634 FALL 2013 LECTURE 8 Modules and Maps
BINF 634 FALL 2013 LECTURE 8 Modules and Maps 13
Restriction MapsProblem: Given a DNA sequence and the name of a restriction enzyme in
REBASE Find all location of the restriction sites on the DNA
Pseudocode:read in rebase data from file “bionet”create hash with key = enzyme name and value = regular expressionfor each user query (in the form of an enzyme name)
if query is defined in the hashget positions that match regular expr in DNA
report positions, if any}
BINF 634 FALL 2013 LECTURE 8 Modules and Maps 14
Range Operatorstart .. end
In list context, returns a slice of an array:
print "@A[10 .. 20]\n"; # print items 10 thru 20
In a scalar context, return true if $. (line number) is currently between start and end values:
# print out first 20 lines of file:
while (<FILEHANDLE>) {
print if (1 .. 20);
}
# print out up to first line containing "foo":
while (<FILEHANDLE>) {
print if (1 .. /foo/);
}
BINF 634 FALL 2013 LECTURE 8 Modules and Maps 15
Finding matches with regular expressions
Reminder: If we use the global modifier g, then pos($string) returns position after the match:
$string = "ATCGCATGGAA";
$string =~ /T.G/g;
print "$& ends at position ", pos($string)-1, "\n";
$string =~ /T.G/g;
print "$& ends at position ", pos($string)-1, "\n";
Output:
TCG ends at position 3
TGG ends at position 8
# Example 9-2 Subroutine to parse a REBASE datafile# parseREBASE-Parse REBASE bionet file## A subroutine to return a hash where# key = restriction enzyme name# value = blank-separated recognition site and regular expression
sub parseREBASE { my($rebasefile) = @_;
use BeginPerlBioinfo; # see Chapter 6 about this module
# Declare variables my @rebasefile = ( ); my %rebase_hash = ( ); my $name; my $site; my $regexp;
# Read in the REBASE file # note: incorrect on p. 191 my $rebase_filehandle = open_file($rebasefile);
16BINF 634 FALL 2013 LECTURE 8 Modules and Maps
while(<$rebase_filehandle>) { # note: incorrect on p. 191 # Discard header lines next if ( 1 .. /Rich Roberts/ );
# Discard blank lines next if /^\s*$/;
# Split the fields my @fields = split( " ", $_);
# Get and store the name and the recognition site # grab just the first and last fields $name = shift @fields;
$site = pop @fields;
# Translate the recognition sites to regular expressions $regexp = IUB_to_regexp($site);
# Store the data into the hash $rebase_hash{$name} = "$site $regexp"; }
# Return the hash containing the reformatted REBASE data return %rebase_hash;}
17BINF 634 FALL 2013 LECTURE 8 Modules and Maps
############################################################## Subroutine IUB_to_regexp## Translate IUB ambiguity codes to regular expressions## The IUB ambiguity codes# R = G or A Y = C or T M = A or C K = G or T# S = G or C W = A or T B = not A (C or G or T)# D = not C (A or G or T) H = not G (A or C or T)# V = not T (A or C or G) N = A or C or G or T
sub IUB_to_regexp { my($iub) = @_; my $regular_expression = ''; my %iub2character_class = ( A => 'A', C => 'C', G => 'G', T => 'T', R => '[GA]', Y => '[CT]', M => '[AC]', K => '[GT]', S => '[GC]', W => '[AT]', B => '[CGT]', D => '[AGT]', H => '[ACT]', V => '[ACG]', N => '[ACGT]', );
# Remove the ^ signs $iub =~ s/\^//g;
# Translate the iub sequence for (my $i = 0; $i < length($iub); ++$i ){ $regular_expression .= $iub2character_class{substr($iub,$i,1)}; }
return $regular_expression;} 18BINF 634 FALL 2013 LECTURE 8 Modules and Maps
Sample Main Program to test parseREBASE subroutine:
#!/usr/bin/perluse warnings;use strict;my %rebasehash = ();%rebasehash = parseREBASE("bionet");for my $key (sort keys %rebasehash) {
my ($site, $regexp) = split " ", $rebasehash{$key};print "enzyme = $key site = $site regexp = $regexp\n";
}
sub parseREBASE { ...}
Output:enzyme = AaaI site = C^GGCCG regexp = CGGCCGenzyme = AacI site = GGATCC regexp = GGATCCenzyme = AaeI site = GGATCC regexp = GGATCCenzyme = AagI site = AT^CGAT regexp = ATCGATenzyme = AaqI site = GTGCAC regexp = GTGCACenzyme = AarI site = ^NNNNNNNNGCAGGTG regexp =[ACGT][ACGT][ACGT][ACGT][ACGT][ACGT][ACGT][ACGT]GCAGGTG
19BINF 634 FALL 2013 LECTURE 8 Modules and Maps
BINF 634 FALL 2013 LECTURE 8 Modules and Maps 20
Restriction MapsProblem: Given a DNA sequence and the name of a restriction
enzyme in REBASE Find all location of the restriction sites on the DNA
Pseudocode:read in rebase data from file “bionet”create hash with key = enzyme name and value = regular expressionfor each user query (in the form of an enzyme name)
if query is defined in the hashget positions that match regular expr in DNAreport positions, if any
}
###################################################################### Subroutine match_positions# Find locations of a match of a regular expression in a string# Return an array of positions where the regular expression# appears in the string
sub match_positions { my($regexp, $sequence) = @_; my @positions = ( );
# Determine positions of regular expression matches while ( $sequence =~ /$regexp/ig ) { push @positions, pos($sequence) - length($&) + 1; } return @positions;}
# sample main program:
my $dna = “ATTGATAAGTCA”;
my @pos = match_positions(“T[GC]A”, $dna);
print “@pos\n”;
exit;
Output: 3 10
21BINF 634 FALL 2013 LECTURE 8 Modules and Maps
N.B. – pos returns the position relative to 0 immediately after the match. So the match gives us the position at the start of the match relative to 1.
BINF 634 FALL 2013 LECTURE 8 Modules and Maps 22
Restriction MapsProblem: Given a DNA sequence and the name of a restriction
enzyme in REBASE Find all location of the restriction sites on the DNA
Pseudocode:read in rebase data from file “bionet”create hash with key = enzyme name and value = regular expressionfor each user query (in the form of an enzyme name)
if query is defined in the hashget positions that match regular expr in
DNAreport positions, if any
}
#!/usr/bin/perl -w# Make restriction map from user queries on names of # restriction enzymes
use strict;use warnings;use BeginPerlBioinfo; # see Chapter 6 about this module
# Declare and initialize variablesmy %rebase_hash = ( );my @file_data = ( );my $query = '';my $dna = '';my $recognition_site = '';my $regexp = '';my @locations = ( );
# Read in the file "sample.dna"@file_data = get_file_data("sample.dna");
# Extract the DNA sequence data from the file "sample.dna"$dna = extract_sequence_from_fasta_data(@file_data);
# Get the REBASE data into a hash, from file "bionet"%rebase_hash = parseREBASE("bionet");
23BINF 634 FALL 2013 LECTURE 8 Modules and Maps
# Prompt user for restriction enzyme names, create restriction mapdo { print "Search for what restriction site (or quit)?: "; $query = <STDIN>; chomp $query;
# Exit if empty query if ($query =~ /^\s*$/ ) { exit;}
# Perform the search in the DNA sequence if ( exists $rebase_hash{$query} ) { ($recognition_site, $regexp) = split " ", $rebase_hash{$query};
# Create the restriction map @locations = match_positions($regexp, $dna);
# Report the restriction map to the user if (@locations) { print "Searching for $query $recognition_site $regexp\n"; print "A restriction site for $query at locations:\n"; print join(" ", @locations), "\n"; } else { print "A restriction enzyme $query is not in the DNA:\n"; } } print "\n";} until ( $query =~ /quit/ );exit; 24BINF 634 FALL 2013 LECTURE 8 Modules and Maps
BINF 634 FALL 2013 LECTURE 8 Modules and Maps 25
% rebase.plSearch for what restriction site (or quit)?: AceISearching for AceI G^CWGC GC[AT]GCA restriction site for AceI at locations:54 94 582 660 696 702 840 855 957
Search for what restriction site (or quit)?: AceIIA restriction enzyme AceII is not in the DNA:
Search for what restriction site (or quit)?: Acc36ISearching for Acc36I ^NNNNNNNNGCAGGT [ACGT][ACGT][ACGT][ACGT][ACGT][ACGT][ACGT][ACGT]GCAGGTA restriction site for Acc36I at locations:166
Search for what restriction site for (or quit)?: quit
%
The Program in Action!!