binf 634 fall 2013 lecture 8 modules and maps1 thanks to john grefenstette for many of these slides...

BINF 634 FALL 2013 LECTURE 8 Modules and Maps 1

Thanks to John Grefenstette for Many of These Slides !!

Topics Midterm Discussions Program 2 Discussions Perl Modules Module Getopt::Std Range Operator Restriction Maps Parsing Rebase File Quiz 4

Midterm Discussions Here are my solutions


Program Discussions Here are my solutions



Perl Modules - I You can put useful subroutines into modules for future

use Only write and debug subroutines once

Creating a module: Put subroutines into a file with extension .pm, e.g. MySubs.pm

Include "1;" as last line in file


Perl Modules - II Using a module:

Include statement:use Mysubs;

note: leave off the .pm from the module name in use statement

Then use any subroutine defined in MySubs.pm Modules are included at compile time

Example: BeginPerlBioinformatics.pm contains subroutines from

Tisdall textbook (see web page) http://oreilly.com/catalog/9780596000806/


Perl Modules

Where does Perl find the modules? Perl looks in the list of directories in built-in array @INC

includes paths to standard modules such as strict.pm and warnings.pm

@INC automatically includes the current directory

You can prepend directories to @INC by using the -I switch% perl -I"/mypath/libdir" prog.pl

You also can tell Perl where to look as follows: use lib "/mypath";use MyFASTASubs; # file is: "/mypath/MyFASTASubs.pm"use MyRandomSubs;

% cat RAND.pm## subroutines for using random numbers#

sub random_uniform { my ($lower, $upper) = @_; return $lower + rand ($upper - $lower);};

sub random_int { my ($lower, $upper) = @_; return int($lower + rand ($upper - $lower + 1));}

sub shuffle_array { my (@a) = @_; my $length = scalar @a; for (my $i = 0; $i < $length; $i++) { my $j = random_int($i, $length-1); ($a[$i], $a[$j]) = ($a[$j], $a[$i]); } return @a; }

sub shuffle_string { my ($s) = @_; my @a = split "", $s; return join "", shuffle_array(@a);}

1;7BINF 634 FALL 2013 LECTURE 8 Modules and Maps

#!/usr/bin/perluse strict;use warnings;# file: test_myrand.pluse lib 'C:\Documents and Settings\Owner\workspace\binf634_book_examples';use RAND;my $dna = "AAAAAAAATTTTTTTTTTGGGGGGGGCCCCCCCCC";print "$dna\n"; $dna = shuffle_string($dna);print "$dna\n";

% test_myrand.plAAAAAAAATTTTTTTTTTGGGGGGGGCCCCCCCCCGCTACGCAGAGGTTAGTCATCTCGACATACTGCTT

% test_myrand.plAAAAAAAATTTTTTTTTTGGGGGGGGCCCCCCCCCTCTTCGGAACCTACGACGTTCTATACAATGCTGGG

8BINF 634 FALL 2013 LECTURE 8 Modules and Maps


use Getopt::Std;Suppose prog.pl contains the statements: use Getopt::Std;my %opts = (); # a hash to hold the optionsgetopts('bnf:', \%opts);

# -b & -n are boolean flags# -f takes an argument (indicated by ":")

Hash keys will be x (where x is the switch name) with key values the value of the argument or 1 if no argument is specified.The following set %opts = (n=>1, f=>"infile")% prog.pl -f infile -n% prog.pl -nf infile% prog.pl -n -f infile

#!/usr/bin/perluse strict;# file: opt.pluse Getopt::Std;my %options=(); # empty hash of options

# "d:" means d takes an argumentgetopts("od:fF",\%options);

print "-o $options{o}\n" if defined $options{o};print "-d $options{d}\n" if defined $options{d};print "-f $options{f}\n" if defined $options{f};print "-F $options{F}\n" if defined $options{F};

print "Unprocessed by Getopt::Std:\n" if @ARGV;foreach (@ARGV) { print "$_\n";}

% opt.pl -Fd5 -o 123 foo-o 1-d 5-F 1Unprocessed by Getopt::Std:123 foo


processing of options stopped whenit saw 123


Restriction Maps (Tisdall Ch. 9) Restriction Enzymes are “chemical scissors” that cut DNA

in specific places specified by short sequence patterns HindII binds and cuts at GTY^RAC Y = C or T (pyrimidines) R = A or G (purines) ^ indicates cleave site

A Restriction Map of a DNA sequence shows all the positions at which a given Restriction Enzyme will cut

Several hundred restriction enzymes are known Database REBASE: http://www.neb.com

REBASE version 104 bionet.104 =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= REBASE, The Restriction Enzyme Database http://rebase.neb.com Copyright (c) Dr. Richard J. Roberts, 2001. All rights reserved. =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Rich Roberts Mar 30 2001 AaaI (XmaIII) C^GGCCGAacI (BamHI) GGATCCAaeI (BamHI) GGATCCAagI (ClaI) AT^CGATAaqI (ApaLI) GTGCACAarI CACCTGCNNNN^AarI ^NNNNNNNNGCAGGTGAatI (StuI) AGG^CCTAatII GACGT^CAauI (Bsp1407I) T^GTACAAbaI (BclI) T^GATCAAbeI (BbvCI) CC^TCAGCAbeI (BbvCI) GC^TGAGGAbrI (XhoI) C^TCGAGAcaI (AsuII) TTCGAAAcaII (BamHI) GGATCCAcaIII (MstI) TGCGCAAcaIV (HaeIII) GGCCAccI GT^MKAC

# The IUB ambiguity codesR = G or A Y = C or T M = A or CK = G or TS = G or C W = A or T B = not A (C or G or T)D = not C (A or G or T) H = not G (A or C or T)V = not T (A or C or G) N = A or C or G or T



Restriction MapsProblem: Given a DNA sequence and the name of a restriction enzyme in

REBASE Find all location of the restriction sites on the DNA

Pseudocode:read in rebase data from file “bionet”create hash with key = enzyme name and value = regular expressionfor each user query (in the form of an enzyme name)

if query is defined in the hashget positions that match regular expr in DNA

report positions, if any}


Range Operatorstart .. end

In list context, returns a slice of an array:

print "@A[10 .. 20]\n"; # print items 10 thru 20

In a scalar context, return true if $. (line number) is currently between start and end values:

# print out first 20 lines of file:

while (<FILEHANDLE>) {

print if (1 .. 20);

}

# print out up to first line containing "foo":

while (<FILEHANDLE>) {

print if (1 .. /foo/);

}


Finding matches with regular expressions

Reminder: If we use the global modifier g, then pos($string) returns position after the match:

$string = "ATCGCATGGAA";

$string =~ /T.G/g;

print "$& ends at position ", pos($string)-1, "\n";

$string =~ /T.G/g;

print "$& ends at position ", pos($string)-1, "\n";

Output:

TCG ends at position 3

TGG ends at position 8

# Example 9-2 Subroutine to parse a REBASE datafile# parseREBASE-Parse REBASE bionet file## A subroutine to return a hash where# key = restriction enzyme name# value = blank-separated recognition site and regular expression

sub parseREBASE { my($rebasefile) = @_;

use BeginPerlBioinfo; # see Chapter 6 about this module

# Declare variables my @rebasefile = ( ); my %rebase_hash = ( ); my $name; my $site; my $regexp;

# Read in the REBASE file # note: incorrect on p. 191 my $rebase_filehandle = open_file($rebasefile);


while(<$rebase_filehandle>) { # note: incorrect on p. 191 # Discard header lines next if ( 1 .. /Rich Roberts/ );

# Discard blank lines next if /^\s*$/;

# Split the fields my @fields = split( " ", $_);

# Get and store the name and the recognition site # grab just the first and last fields $name = shift @fields;

$site = pop @fields;

# Translate the recognition sites to regular expressions $regexp = IUB_to_regexp($site);

# Store the data into the hash $rebase_hash{$name} = "$site $regexp"; }

# Return the hash containing the reformatted REBASE data return %rebase_hash;}


############################################################## Subroutine IUB_to_regexp## Translate IUB ambiguity codes to regular expressions## The IUB ambiguity codes# R = G or A Y = C or T M = A or C K = G or T# S = G or C W = A or T B = not A (C or G or T)# D = not C (A or G or T) H = not G (A or C or T)# V = not T (A or C or G) N = A or C or G or T

sub IUB_to_regexp { my($iub) = @_; my $regular_expression = ''; my %iub2character_class = ( A => 'A', C => 'C', G => 'G', T => 'T', R => '[GA]', Y => '[CT]', M => '[AC]', K => '[GT]', S => '[GC]', W => '[AT]', B => '[CGT]', D => '[AGT]', H => '[ACT]', V => '[ACG]', N => '[ACGT]', );

# Remove the ^ signs $iub =~ s/\^//g;

# Translate the iub sequence for (my $i = 0; $i < length($iub); ++$i ){ $regular_expression .= $iub2character_class{substr($iub,$i,1)}; }

return $regular_expression;} 18BINF 634 FALL 2013 LECTURE 8 Modules and Maps

Sample Main Program to test parseREBASE subroutine:

#!/usr/bin/perluse warnings;use strict;my %rebasehash = ();%rebasehash = parseREBASE("bionet");for my $key (sort keys %rebasehash) {

my ($site, $regexp) = split " ", $rebasehash{$key};print "enzyme = $key site = $site regexp = $regexp\n";

}

sub parseREBASE { ...}

Output:enzyme = AaaI site = C^GGCCG regexp = CGGCCGenzyme = AacI site = GGATCC regexp = GGATCCenzyme = AaeI site = GGATCC regexp = GGATCCenzyme = AagI site = AT^CGAT regexp = ATCGATenzyme = AaqI site = GTGCAC regexp = GTGCACenzyme = AarI site = ^NNNNNNNNGCAGGTG regexp =[ACGT][ACGT][ACGT][ACGT][ACGT][ACGT][ACGT][ACGT]GCAGGTG



Restriction MapsProblem: Given a DNA sequence and the name of a restriction

enzyme in REBASE Find all location of the restriction sites on the DNA


if query is defined in the hashget positions that match regular expr in DNAreport positions, if any

}

###################################################################### Subroutine match_positions# Find locations of a match of a regular expression in a string# Return an array of positions where the regular expression# appears in the string

sub match_positions { my($regexp, $sequence) = @_; my @positions = ( );

# Determine positions of regular expression matches while ( $sequence =~ /$regexp/ig ) { push @positions, pos($sequence) - length($&) + 1; } return @positions;}

# sample main program:

my $dna = “ATTGATAAGTCA”;

my @pos = match_positions(“T[GC]A”, $dna);

print “@pos\n”;

exit;

Output: 3 10


N.B. – pos returns the position relative to 0 immediately after the match. So the match gives us the position at the start of the match relative to 1.


Restriction MapsProblem: Given a DNA sequence and the name of a restriction

enzyme in REBASE Find all location of the restriction sites on the DNA


if query is defined in the hashget positions that match regular expr in

DNAreport positions, if any

}

#!/usr/bin/perl -w# Make restriction map from user queries on names of # restriction enzymes

use strict;use warnings;use BeginPerlBioinfo; # see Chapter 6 about this module

# Declare and initialize variablesmy %rebase_hash = ( );my @file_data = ( );my $query = '';my $dna = '';my $recognition_site = '';my $regexp = '';my @locations = ( );

# Read in the file "sample.dna"@file_data = get_file_data("sample.dna");

# Extract the DNA sequence data from the file "sample.dna"$dna = extract_sequence_from_fasta_data(@file_data);

# Get the REBASE data into a hash, from file "bionet"%rebase_hash = parseREBASE("bionet");


# Prompt user for restriction enzyme names, create restriction mapdo { print "Search for what restriction site (or quit)?: "; $query = <STDIN>; chomp $query;

# Exit if empty query if ($query =~ /^\s*$/ ) { exit;}

# Perform the search in the DNA sequence if ( exists $rebase_hash{$query} ) { ($recognition_site, $regexp) = split " ", $rebase_hash{$query};

# Create the restriction map @locations = match_positions($regexp, $dna);

# Report the restriction map to the user if (@locations) { print "Searching for $query $recognition_site $regexp\n"; print "A restriction site for $query at locations:\n"; print join(" ", @locations), "\n"; } else { print "A restriction enzyme $query is not in the DNA:\n"; } } print "\n";} until ( $query =~ /quit/ );exit; 24BINF 634 FALL 2013 LECTURE 8 Modules and Maps


% rebase.plSearch for what restriction site (or quit)?: AceISearching for AceI G^CWGC GC[AT]GCA restriction site for AceI at locations:54 94 582 660 696 702 840 855 957

Search for what restriction site (or quit)?: AceIIA restriction enzyme AceII is not in the DNA:

Search for what restriction site (or quit)?: Acc36ISearching for Acc36I ^NNNNNNNNGCAGGT [ACGT][ACGT][ACGT][ACGT][ACGT][ACGT][ACGT][ACGT]GCAGGTA restriction site for Acc36I at locations:166

Search for what restriction site for (or quit)?: quit

%

The Program in Action!!


HW Read Tisdall Appendix Chapter 9 and

Appendix B