perl programming
DESCRIPTION
Perl Programming. Paul Tymann Computer Science Department Rochester Institute of Technology [email protected]. Strings. A collection of characters This slide consists of a sequence of strings CS folk have been working with strings for years - PowerPoint PPT PresentationTRANSCRIPT
2
Strings
• A collection of characters– This slide consists of a sequence of strings
• CS folk have been working with strings for years
• Many tools and algorithms have been developed to work with strings
3
Sequences
• Ask a biologist what a sequence is:– ATGCCTATGCCCCTTGAGAGA
• Show that to a CS type and ask “what is this”– It is a string!!
• In a way bioinformatics is all about manipulating strings
• CS types are real good at manipulating strings!!
4
What the heck is Perl?
• Perl a computer language designed to scan arbitrary text files, extract information from those text files, and print reports based on that information– “Perl” == “Practical Extraction and Report
Language” • What makes Perl powerful?
– It has sophisticated pattern matching capabilities– Straightforward I/O
• It was created, written, developed, and maintained by Larry Wall ([email protected])
5
Where does Perl stand?
• Perl is an interpreted language– Which means it runs slower than a compiled
language– BUT it is much easier, and quicker, to develop
programs– Some people would call Perl a scripting language
• The language is intended to be practical (easy to use, efficient, complete) rather than beautiful (tiny, elegant, minimal)
• It is a useful tool that can get the job done
6
Lots of People Are Using Perl
• There are lots of people using Perl and as a result there are lots of libraries that you can get for free
• If you can think of an application, chances are you can find the Perl code to do it
• This means writing Perl programs to do sophisticated things is easy and does not take long to to.
7
BioPerl
• Bioperl is a collection of perl modules that facilitate the development of perl scripts for bioinformatics applications
• Bioperl provides a means by which large quantities of sequence data can be analyzed in ways that are typically difficult or impossible with web based systems
• Bioperl is open source software that is still under active development
8
BioPerl Modules
• Sequence Object• Sequence flat-file format I/O• Sequence alignment objects• BLAST similarity search• Sequence database access• Sequence file indexing• Common Base Object
9
Is Perl THE tool?
• Probably not• Perl is great for munging text data to a
different form– Get a blast search off the web and extract info
from it and place it in your database• Perl is great if you want it done fast• What about more complicated programming?
– You might want to get a bigger hammer!!– There are many BIO.* packages out there.
10
Your First Perl Program
# Say Helloprint “Hello World\n”;
Comment Ignored by Interpreter
A String – a collection of characters
Escape character - newline
Print statement
ExecutionOrder
11
Perl - Unix Style
#!/usr/local/bin/perl -w
# Say Helloprint “Hello World\n”;
Comment used by Unix to run Perl
12
How To Make It Run
Create a text file that contains a Perl program (script)
13
How To Make It Run
Invoke the interpreterto run the program
14
Sometimes we make misteaks
Create the Perl script
Should be “print”
15
Sometimes we make misteaks
Run the interpreter
16
Sometimes we make misteaks
Fix the mistake
Try again
17
Your Turn!!
• Write a Perl program that prints out your name and the name of your workshop partner on separate lines
• Sample Output:Paul TymannRhys Price Jones
18
Your Second Perl Program
# Convert DNA string to RNA string$DNA = “AGGGGAGGCCTTACT”;$RNA = $DNA;$RNA =~ s/T/U/g;print “$RNA\n”;
A scalar variable holds the characters in the string
Assignment – evaluate right side and place in left
Apply operation on right to the contents of the variable on the left
Substitute all occurrences of T with U
19
Reading from the Keyboard
• You can read information from the keyboard by using– <STDIN>
• For example to read a string from the keyboard and place that string in the string variable str– $STR = <STDIN>;
• The line termination character will be read and appended to the string
20
Modified Program
# Convert DNA string to RNA stringprint "Enter DNA string: ";$DNA = <STDIN>;$RNA = $DNA;$RNA =~ s/T/U/g;print "$RNA\n";
21
Arithmetic and Logic Operators
Symbol Meaning
** Exponentiation
! Logical Negation
*/%
MultiplicationDivisionRemainder
+-
AdditionSubtraction
<><=>=
Less thanGreater thanLess or equalGreater or equal
==!=
EqualNot Equal
&& Boolean And
|| Boolean Or
22
Flow of Control
• Conditional– if ( expression ) { statements }– if ( expression ) { statements } else { statements }– If ( expression ) { statements } elsif …
• Loops– while ( expression ) { statements }– for ( init; test; increment ) { statements }
23
Examples
# Print 1 through 100 twice
$i = 1;while ( $i <= 100 ) { print $i,”\n”; $i = $i + 1;}
for ( $i = 1; $i <= 100; $i = $i + 1 ) {print $i,”\n”; $i = $i + 1;}
24
Reverse Complement
# Calculate the reverse complement$dna = <STDIN>;$revcomm = “”;for ( $pos=0; $pos<length($dna)-1; $pos = $pos + 1 ) { $base = substr( $dna, $pos, 1 ); if ( $base eq ‘A’ ) { $base = ‘T’; } elsif ( $base eq ‘T’ ) { $base = ‘A’; } elsif ( $base eq ‘C’ ) { $base = ‘G’; } else { $base = ‘C’; } $revcomm = $revcomm . $base;}print $revcomm,”\n”;
Don’t include the newline
String concatenation
25
Perl IS Different
while ( <> ) { print if /blue/;}
Treat each argument on the command line as a file name. Open the files one at a time and step through them a line at a time
Print the current line if it contains the string “blue”
26
Your Turn!!
• Change the reverse complement program so that– It reads the DNA strings from a file whose name is
supplied on the command line. You may assume that each DNA string is on a separate line
– Instead of calculating the reverse complement starting at the beginning of the string, your program must start at the end of the DNA and work towards the front
27
Lists
• A list is an object consisting of a sequence of values– 1, 2, 3, 5, 7, 11, 13, 17, 19, 23– 1..10– ‘a’..’z’
• A list that has been given a name is called an array– @small_primes = (1, 2, 3, 5, 7, 11, 13, 17);
• The individual elements of a list must be scalars
28
Fibonacci
@fibs = ( 1, 1 );
for ( $i = 2; $i <= 10; $i = $i + 1 ) { $fibs[ $i ] = $fibs[ $i - 1 ] + $fibs[ $i - 2];}
print “I calculated ",$#fibs," fibs\n";print @fibs,"\n"
A list with the first two Fibonacci numbers
Add the previous two numbers to get the next one
Extends the list and puts the next number there Numbers of items in the list
29
Regular Expressions
• Provide a way of writing a compact description of a set of strings– Sort of like wildcards
• Single character patterns– A single character matches itself– A “.” matches any single character except newline– [characters] – matches any one of the characters– ^ means “does not match”
30
Examples
• G• [0123456789]• [0-9]• [a-zA-z]• [^0-9]
31
Character Class Abbreviations
Construct Class Negated Class\d (digits) [0-9] \D [^0-9]
\w (words) [a-zA-Z0-9_]* \W [^a-zA-Z0-9_]
\s space [ \r\t\n\f] \S [^ \r\t\n\f]
32
Grouping Patterns
• Sequence– abc
• Multipliers– * - zero or more of the previous character
• a*b b, ab, aab, aaab, aaaab, …
– + - one or more of the previous character• a+b ab, aab, aaab, …
33
My Problem
XXXX, ROBERT 4653 N VCSG-4 rma9999 XXXXXX, ADAM 3976 N VCSG-4 716-555-4281 alb9999 XXXXXXX, EDWARD 4637 N VCSG-2 716-555-4780 esb9999 XXXXXXX, JOHN 1906 N VCSG-4 716-555-4780 XXXX, DERRICK 6432 N VCSG-2 716-555-3161 dxc9999 XXXXXXXXX, JOHN 5034 N VCSG-2 716-555-3894 jak9999 XXX, JASON 9020 N VCSG-2 716-555-3145 jsl9999 XXXXXXX, SARAH 7610 N VCSG-2 716-555-3147 sem9999 XXXXXXXX, CHRISTOPHER 6309 N VCSG-2 716-555-3427 cco9999 XXXXXXX, MICHAEL 8195 N VCSG-2 716-555-3166 mpp9999 XXXXXX, SHAUN 9925 N VCSG-2 716-555-3145 sls9999 XXXXXX, WILLIAM 2568 N VCSG-2 716-555-3144 wjw9999 XXXXXX, PATRICK 2335 N EECC-2 716-555-3144 psw9999
34
Roster to CSV
while(<>) {
($last,$first,$id,$ntid,$gradeType,$program,$phone,$email)= /([^,]+), (\S+) (\d{4}) (\S*) (\S*) (\S+) (\S*) (\S*).*/;
print "\"$last,$first\",$id,$program,$email\@cs.rit.edu\n";}
Match 1 or more non-comma characters
Match 1 or more non-whitespace characters
Match 4 digits Match 0 or more non-whitespace characters (the fields may not be in the input
Match anything!!
XXXXXXX, EDWARD 4637 N VCSG-2 716-555-4780 esb9999
35
The Result
"XXXX,ROBERT",4653,VCSG-4,[email protected]"XXXXXX,ADAM",3976,VCSG-4,[email protected]"XXXXXXX,EDWARD",4637,VCSG-2,[email protected]"XXXXXXX,JOHN",1906,VCSG-4,@cs.rit.edu"XXXX,DERRICK",6432,VCSG-2,[email protected]"XXXXXXXXX,JOHN",5034,VCSG-2,[email protected]"XXX,JASON",9020,VCSG-2,[email protected]"XXXXXXX,SARAH",7610,VCSG-2,[email protected]"XXXXXXXX,CHRISTOPHER",6309,VCSG-2,[email protected]"XXXXXXX,MICHAEL",8195,VCSG-2,[email protected]"XXXXXX,SHAUN",9925,VCSG-2,[email protected]"XXXXXX,WILLIAM",2568,VCSG-2,[email protected]"XXXXXX,PATRICK",2335,EECC-2,[email protected]