8.1 common errors – exercise #3 assuming something on the variable part of the input file. when...

17
8.1 Common Errors – Exercise #3 Assuming something on the variable part of the input file. When parsing a format file (genebank, fasta or any other format), you should only rely on the format for parsing and not on the variable part of the input. Thus parsing by features such as these is wrong: Assuming each line in the title will start with a lowercase letter Assuming the title will be composed of only 2 lines It is legitimate to rely on the presence of the words ‘TITLE’ and ‘JOURNAL’ for the parsing as these are a part of the format. Reading the whole file at once (@all_lines = <$fh>;). This is risky in case the file is very large… When we do not need all the lines in the file at once, we try to use $line = <$fh> in a ‘while’ loop. Performing an action on a variable without checking if it is defined. This can generate errors in some cases. Use of functions/features not taught in class.

Upload: janice-mcdowell

Post on 18-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 8.1 Common Errors – Exercise #3 Assuming something on the variable part of the input file. When parsing a format file (genebank, fasta or any other format),

8.1Common Errors – Exercise #3

• Assuming something on the variable part of the input file.

When parsing a format file (genebank, fasta or any other format), you should only rely on the format for parsing and not on the variable part of the input. Thus parsing by features such as these is wrong:

– Assuming each line in the title will start with a lowercase letter– Assuming the title will be composed of only 2 lines

It is legitimate to rely on the presence of the words ‘TITLE’ and ‘JOURNAL’ for the parsing as these are a part of the format.

• Reading the whole file at once (@all_lines = <$fh>;).

This is risky in case the file is very large… When we do not need all the lines in the file at once, we try to use $line = <$fh> in a ‘while’ loop.

• Performing an action on a variable without checking if it is defined.

This can generate errors in some cases.

• Use of functions/features not taught in class.

Page 2: 8.1 Common Errors – Exercise #3 Assuming something on the variable part of the input file. When parsing a format file (genebank, fasta or any other format),

8.2

Solution to HW3 Q#6

• For each protein record print the first line (the LOCUS line) followed by a sorted list of its reference TITLEs.

1. Read the file

2. if reached LOCUS

line print it

3. if reached TITLE start

an inner loop until

reaching the JOURNAL

line (to take the full title)

4. push entire TITLE to

titles array

5. If reached a FEATURES

line print the title array and

initialize

...

...

Page 3: 8.1 Common Errors – Exercise #3 Assuming something on the variable part of the input file. When parsing a format file (genebank, fasta or any other format),

8.3my $line = <$in>; # read input lineswhile (defined $line){

chomp($line);  # if reached LOCUS line print it

if ((substr($line,0,5) eq "LOCUS") ) { print "\n$line\n";}

  # if reached TITLE start an inner loop until reaching the JOURNAL lineif ( (length($line) > 7) && (substr($line,2,5) eq "TITLE") ) {

while ((defined $line) && (substr($line,2,7) ne "JOURNAL")) {chomp $line;$title = $title.substr($line,12); # concatenate the title line$line = <$in>;}push(@titleArray,$title); # push entire title to title array$title="";

}# if reached FEATURES line - sort and print titles arrayif ((substr($line,0,8) eq "FEATURES") ) {

@titleArray = sort(@titleArray);foreach $title (@titleArray) {print "$title\n";}@titleArray = (); # empty title array

}  $line = <$in>;} 

Page 4: 8.1 Common Errors – Exercise #3 Assuming something on the variable part of the input file. When parsing a format file (genebank, fasta or any other format),

8.4

Hashes(associative arrays)

Page 5: 8.1 Common Errors – Exercise #3 Assuming something on the variable part of the input file. When parsing a format file (genebank, fasta or any other format),

8.5

Let's say we want to create a phone book . . .

Enter a name that will be added to the phone book:

Dudi

Enter a phone number:

6409245

Enter a name that will be added to the phone book:

Dudu

Enter a phone number:

6407693

Hash Motivation

Page 6: 8.1 Common Errors – Exercise #3 Assuming something on the variable part of the input file. When parsing a format file (genebank, fasta or any other format),

8.6

An associative array (or simply – a hash) is an unordered set of

pairs of keys and values. Each key is associated with a value.

A hash variable name always start with a “%”:

my %hash;

Initialization:

%hash = ("a"=>5, "bob"=>"zzz", 50=>"John");

Accessing:

you can access a value by its key:

print $hash{50}; John

Tip you can reset the hash (to an empty one) by %hash = ();

Note: a key in a hash will be interpreted as a string. These are equivalent:

Hash – an associative array

%hash

5"a" >=

"zzz""bob" >=

"John"50 >=

50”>=John”

“50”>=”John”

$hash{50}

$hash{“50”}

Page 7: 8.1 Common Errors – Exercise #3 Assuming something on the variable part of the input file. When parsing a format file (genebank, fasta or any other format),

8.7

modifying :

$hash{bob} = "aaa"; (modifying an existing value)

adding :

$hash{555} = "z"; (adding a new key-value pair)

You can ask whether a certain key exists in a hash:

if (exists $hash{50} )...

You can delete a certain key-value pair in a hash:

delete($hash{50});

Hash – an associative array

%hash

5"a" >=

"zzz""bob" >=

"John"50 >=

%hash

5"a" >=

"aaa""bob" >=

"John"50 >=

%hash

5"a" >=

"aaa""bob" >=

"John"50 >=

"z"555 >=

%hash

5"a" >=

"aaa""bob" >=

"z"555 >=

Page 8: 8.1 Common Errors – Exercise #3 Assuming something on the variable part of the input file. When parsing a format file (genebank, fasta or any other format),

8.8 Variable types in PERL

Scalar Array Hash

$number-3.54

$string"hi\n"

@array %hash

>=

>=

>=$array[0]

$hash{key}

Page 9: 8.1 Common Errors – Exercise #3 Assuming something on the variable part of the input file. When parsing a format file (genebank, fasta or any other format),

8.9

An associative array of the phone book suggested in the first slide

(we will see a more elaborated version later on):

Declare

my %phoneBook;

Updating

$phoneBook{"Dudi"} = 9245;

$phoneBook{"Dudu"} = 7693;

Fetching

print $phoneBook{"Dudi"};

Hash – an associative array

%hash

9245"Dudi" >=

7693"Dudu" >=

Page 10: 8.1 Common Errors – Exercise #3 Assuming something on the variable part of the input file. When parsing a format file (genebank, fasta or any other format),

8.10

It is possible to get a list of all the keys in %hash

my @hashKeys = keys(%hash);

Similarly you can get an array of the values in %hash

my @hashVals = values(%hash);

Iterating over hash elements

%hash

5"a" >=

"zzz""bob" >=

"John"50 >=

@hashKeys

"bob" 50"a"

@hashVals

5 "John" "zzz"

Page 11: 8.1 Common Errors – Exercise #3 Assuming something on the variable part of the input file. When parsing a format file (genebank, fasta or any other format),

8.11

To iterate over all the values in %hash

my @hashVals = values(%hash);

foreach my $value (@hashVals)...

To iterate over the keys in %hash

my @hashKeys = keys(%hash);

foreach my $key (@hashKeys)...

Iterating over hash elements

%hash

5"a" >=

"zzz""bob" >=

"John"50 >=

@hashKeys

"bob" 50"a"

@hashVals

5 "John" "zzz"

Page 12: 8.1 Common Errors – Exercise #3 Assuming something on the variable part of the input file. When parsing a format file (genebank, fasta or any other format),

8.12

For example, iterating over the keys in %hash :

my @hashKeys = keys(%hash);

foreach my $key (@hashKeys) {

print "The key is $key\n";

print "The value is $hash{$key}\n";

}

Iterating over hash elements

%hash

5"a" >=

"zzz""bob" >=

"John"50 >=

The key is bobThe value is zzzThe key is aThe value is 5The key is 50The value is John

@hashKeys

"bob" 50"a"

@hashVals

5 "John" "zzz"

Page 13: 8.1 Common Errors – Exercise #3 Assuming something on the variable part of the input file. When parsing a format file (genebank, fasta or any other format),

8.13

Notably: The elements are given in an arbitrary order,

so if you want a certain order use sort:

my @hashKeys = keys(%hash);

my @sortedHashKeys = sort(@hashKeys);

foreach $key (@sortedHashKeys) {

print "The key is $key\n";

print "The value is $hash{$key}\n";

}

Iterating over hash elements

%hash

5"a" >=

"zzz""bob" >=

"John"50 >=

@hashKeys

"bob" 50"a"

@hashVals

5 "John" "zzz"

Page 14: 8.1 Common Errors – Exercise #3 Assuming something on the variable part of the input file. When parsing a format file (genebank, fasta or any other format),

8.14

####################################### Purpose: Store names and phone numbers in a hash,# and allow the user to ask for the number of a certain name.# Input: Enter name-number pairs, enter "END" as a name to stop,# then enter a name to get his/her number#use strict;

my %phoneNumbers = ();my $number;

Example – phoneBook.pl #1

Page 15: 8.1 Common Errors – Exercise #3 Assuming something on the variable part of the input file. When parsing a format file (genebank, fasta or any other format),

8.15

# Ask user for names and numbers and store in a hash

my $name = "";

while (1==1) {

print "Enter a name that will be added to the phone book:\n";

$name = >STDIN>;

chomp $name;

if ($name eq "END") {

last;

}

print "Enter a phone number: \n";

$number = >STDIN>;

chomp $number;

$phoneNumbers{$name} = $number;

}

Example – phoneBook.pl #2

Page 16: 8.1 Common Errors – Exercise #3 Assuming something on the variable part of the input file. When parsing a format file (genebank, fasta or any other format),

8.16

# Ask for a name and print the corresponding number

$name = "";

while (1==1) {

print "Enter a name to search for in the phone book:\n";

$name = >STDIN>;

chomp $name;

if (exists($phoneNumbers{$name})) {

print "The phone number of $name is: $phoneNumbers{$name}\n";

}

elsif ($name eq "END") {

last;

}

else {

print "Name not found in the book\n";

}

}

Example – phoneBook.pl #3

Page 17: 8.1 Common Errors – Exercise #3 Assuming something on the variable part of the input file. When parsing a format file (genebank, fasta or any other format),

8.17 Class exercise 81. Write a script that reads a file with a list of protein names and lengths

(proteinLengths ):AP_000081 181AP_000174 104AP_000138 145stores the names of the sequences as hash keys, with the length of the sequence as the value. Print the keys of the hash.

2. Add to Q1: Read another file, and print the names that appeared in both files with the same length. Print a warning if the name is the same but the length is different.

3. Write a script that reads a GenPept file (you may use the preproinsulin record), finds all JOURNAL lines, and stores the journal name (as key) and year of publication (as value) in a hash:a. Store only the first year (order of appearance in the file) value for each journal

nameb*. Store all years for each journal name

Then print the names and years, sorted by the journal name (no need to sort the years for the same journal in b*, unless you really want to do so…)