introduction to perl for bioinformatics thursday, april 7
Post on 21-Dec-2015
226 views
TRANSCRIPT
Programming languages
• Self-contained language– Platform-independent
– Used to write O/S
– C (imperative, procedural)
– C++, Java (object-oriented)
– Lisp, Haskell, Prolog (functional)
• Scripting language– Closely tied to O/S
– Perl, Python, Ruby
• Domain-specific language– R (statistics)
– MatLab (numerics)
– SQL (databases)
• An O/S typically manages…– Devices (see above)
– Files & directories
– Users & permissions
– Processes & signals
Perl is the most-used bioinformatics language
Most popular bioinformatics programming languages
Bioinformatics career survey, 2008
Michael Barton
Perl overview
• Interpreted, not compiled– Fast edit-run-revise cycle
• Procedural & imperative– Sequence of instructions (“control flow”)– Variables, subroutines
• Syntax close to C (the de facto standard minimal language)
– Weakly typed (unlike C)– Redundant, not minimal (“there’s more than one way to do it”)– High-level data structures & algorithms
– Hashes, arrays
– Operating System support (files, processes, signals)– String manipulation
Perl basics
• Basic syntax of a Perl program:
#!/usr/local/bin/perl# Elementary Perl programprint "Hello World\n";
"\n" means new line
print statement tells Perl to print the following text to the screen
Single or double quotesenclose a "string literal"(double quotes are "interpolated")
All statements endwith a semicolon
Linesbeginningwith "#" arecomments,and are ignoredby Perl
Hello World
Variables• We can tell Perl to "remember" a particular
value, using the assignment operator “=“:
• The $x is referred to as a "scalar variable".Variable names can contain alphabetic characters, numbers(but not at the start of the name), and underscore symbols "_"Scalar variable names are all prefixed with the dollar symbol.
$x = 3;print $x;
3
$x = "ACGCGT";print $x;
ACGCGT
Binding site for yeasttranscription factor MCB
Arithmetic operations
• Basic operators are + - / * %
• Can also use += -= /= *= ++ --
$x = 14;$y = 3;print "Sum: ", $x + $y, "\n";print "Product: ", $x * $y, "\n";print "Remainder: ", $x % $y, "\n";
Sum: 17Product: 42Remainder: 2
$x = 5;print "x started as $x\n";$x = $x * 2;print "Then x was $x\n";$x = $x + 1;print "Finally x was $x\n";
x started as 5Then x was 10Finally x was 11
Could write$x *= 2;
Could write$x += 1;or even++$x;
String operations
• Concatenation . and .=
• Can find the length of a string using the function length($x)
$a = "pan";$b = "cake";$a = $a . $b;print $a;
pancake
$a = "soap";$b = "dish";$a .= $b;print $a;
soapdish
$mcb = "ACGCGT";print "Length of $mcb is ", length($mcb);
Length of ACGCGT is 6
More string operations
$x = "A simple sentence";print $x, "\n";print uc($x), "\n";print lc($x), "\n";$y = reverse($x);print $y, "\n";$x =~ tr/i/a/;print $x, "\n";print length($x), "\n";
A simple sentenceA SIMPLE SENTENCEa simple sentenceecnetnes elpmis AA sample sentence17
Convert to upper case
Convert to lower case
Reverse the string
Transliterate "i"'s into "a"'s
Calculate the length of the string
Concatenating DNA fragments
$dna1 = "accacgt";$dna2 = "taggtct";print $dna1 . $dna2;
"Transcribing" DNA to RNA
accacguuaggucu
$dna = "accACgttAGGTct";$rna = lc($dna);$rna =~ tr/t/u/;print $rna;
Make it alllower case
DNA string is a mixtureof upper & lower case
Transliterate "t" to "u"
accacgttaggtct
Conditional blocks
• The ability to execute an action contingent on some condition is what distinguishes a computer from a calculator. In Perl, this looks like this:if (condition) { action } else { alternative }
$x = 149;$y = 100;if ($x > $y){ print "$x is greater than $y\n";}else{ print "$x is less than $y\n";}
149 is greater than 100
These braces { }tell Perl whichpiece of codeis contingent onthe condition.
Conditional operators
• Numeric: > >= < <= != ==
• String: eq ne gt lt ge le
$x = 5 * 4;$y = 17 + 3;if ($x == $y) { print "$x equals $y"; } 20 equals 20
"equals""does not equal"
"is alphabeticallygreater than" "is alphabetically
less than"
"is alphabeticallygreater-or-equal"
"is alphabeticallyless-or-equal"
Note that the testfor "$x equals $y" is$x==$y, not $x=$y
($x, $y) = ("Apple", "Banana");if ($y gt $x) { print "$y after $x "; } Banana after Apple
"does not equal"
Shorthand syntax forassigning more thanone variable at a time
Logical operators• Logical operators: && means "and", || means "or"
• An exclamation mark ! is used to negate what follows Thus !($x < $y) means the same as ($x >= $y)
• In computers, the value zero is often used to represent falsehood, while any non-zero value (e.g. 1) represents truth. Thus:
if (1) { print "1 is true\n"; }if (0) { print "0 is true\n"; }if (-99) { print "-99 is true\n"; }
1 is true-99 is true
$x = 222;if ($x % 2 == 0 and $x % 3 == 0){ print "$x is an even multiple of 3\n"; }
222 is an even multiple of 3
Loops
• Here's how to print out the numbers 1 to 10:
• This is a while loop.The code is executed while the condition is true.
$x = 1;while ($x <= 10) { print $x, " "; ++$x;}
1 2 3 4 5 6 7 8 9 10
The code insidethe braces isrepeatedlyexecuted as longas the condition$x<=10 remainstrue
Equivalent to$x = $x + 1;
A common kind of loop
• Let's dissect the code of the while loop again:
• This form of while loop is common enough to have its own shorthand: the for loop.
$x = 1;while ($x <= 10) { print $x, " "; ++$x;}
Initialization
Test for completion
Continuation
for ($x = 1; $x <= 10; ++$x) { print $x, " ";}
InitializationTest for completion
Continuation
defined and undef
• The function defined($x) is true if $x has been assigned a value:
• A variable that has not yet been assigned a value has the special value undef
• Often, if you try to do something "illegal" (like reading from a nonexistent file), you end up with undef as a result
if (defined($newvar)) { print "newvar is defined\n";} else { print "newvar is not defined\n";}
newvar is not defined
Reading a line of data• To read from a file, we first need to open
the file and give it a filehandle.
• Once the file is opened, we can read a single line from it into the scalar $x :This code snippet opens a file called"sequence.txt", and associates it witha filehandle called FILE
open FILE, "sequence.txt";
$x = <FILE>;This reads the next line from the file,including the newline at the end, "\n".if the end of the file is reached, $x isassigned the special value undef
Reading an entire file
• The following piece of code reads every line in a file and prints it out to the screen:
• A shorter version of this is as follows:
open FILE, "sequence.txt";while (defined ($x = <FILE>)) { print $x;}close FILE;
open FILE, "sequence.txt";while ($x = <FILE>) { print $x;}close FILE;
This reads a line of data into$x, then checks if $x is defined.If $x is undef, then the filemust have ended.
this is equivalent todefined($x=<FILE>)
The default variable, $_
• Many operations that take a scalar argument, such as length($x), are assumed to work on $_ if the $x is omitted:
• So we can also read a whole file like this:
$_ = "Hello";print; print length;
Hello5
open FILE, "sequence.txt";while (<FILE>) { print;}close FILE;
This line is equivalent towhile (defined($_=<FILE>)) {
Summary: scalars and loops
• Assignment operator
• Arithmetic operations
• String operations
• Conditional tests
• Logical operators
• Loops • defined and undef• Reading a file
$x = 5;
$y = $x * 3;
if ($y > 10) { print $s; }
$s = "Value of y is " . $y;
if ($y>10 && $s eq "") { exit; }
for ($x=1; $x<10; ++$x) { print $x, "\n"; }
Pattern-matching
• A more sophisticated kind of logical test is to ask whether a string contains a pattern
• e.g. does a yeast promoter sequence contain the MCB binding site, ACGCGT?
$name = "YBR007C";$dna="TAATAAAAAACGCGTTGTCG";if ($dna =~ /ACGCGT/){ print "$name has MCB!\n"; }
20 bases upstream ofthe yeast gene YBR007C
The pattern binding operator =~
The pattern for the MCB binding siteYBR007C has MCB!
FASTA format
• A format for storing multiple named sequences in a single file
• This file contains 3' UTRsfor Drosophila genes CG11604,CG11455 and CG11488
>CG11604TAGTTATAGCGTGAGTTAGTTGTAAAGGAACGTGAAAGATAAATACATTTTCAATACC>CG11455TAGACGGAGACCCGTTTTTCTTGGTTAGTTTCACATTGTAAAACTGCAAATTGTGTAAAAATAAAATGAGAAACAATTCTGGT>CG11488TAGAAGTCAAAAAAGTCAAGTTTGTTATATAACAAGAAATCAAAAATTATATAATTGTTTTTCACTCT
Name of sequence ispreceded by > symbol
NB sequences canspan multiple lines
Call this file fly3utr.txt
Printing all sequence names in a FASTA database
• The key to this program is this block:
open FILE, "fly3utr.txt";while ($x = <FILE>) { if ($x =~ />/) { print $x; }}close FILE;
>CG11604>CG11455>CG11488
if ($x =~ />/) { print $x; }
This pattern matches (and returns TRUE) if the defaultvariable $_ contains the FASTA sequence-name symbol >
This line prints $_ ifthe pattern matched
Pattern replacement
open FILE, "fly3utr.txt";while (<FILE>) { if (/>/) { s/>//; print; }}close FILE;
CG11604CG11455CG11488
New statementremoves the ">"
•The new statement s/>// is an example of a replacement.•General form: s/OLD/NEW/ replaces OLD with NEW•Thus s/>// replaces ">" with "" (the empty string)
$_ is thedefaultvariablefor theseoperations
Finding all sequence lengthsOpen file
Read line
End of file?
Line starts with “>” ?
Remove “\n” newline character at end of line
Sequence name
Sequence data
Add length of line to running totalRecord the name
Reset running total of current sequence length
First sequence?Print last sequence length
Stop
noyes
yes
yes
no
no
Start
Print last sequence length
Finding all sequence lengthsopen FILE, "fly3utr.txt";while (<FILE>) { chomp; if (/>/) { if (defined $len) {
print "$name $len\n"; } $name = $_; $len = 0; } else { $len += length; }}print "$name $len\n";close FILE;
>CG11604 58>CG11455 83>CG11488 68
The chomp statementtrims the newline character"\n" off the end of thedefault variable, $_.Try it without this andsee what happens – andif you can work out why
>CG11604TAGTTATAGCGTGAGTTAGTTGTAAAGGAACGTGAAAGATAAATACATTTTCAATACC>CG11455TAGACGGAGACCCGTTTTTCTTGGTTAGTTTCACATTGTAAAACTGCAAATTGTGTAAAAATAAAATGAGAAACAATTCTGGT>CG11488TAGAAGTCAAAAAAGTCAAGTTTGTTATATAACAAGAAATCAAAAATTATATAATTGTTTTTCACTCT
Reverse complementing DNA
$dna = "accACgttAGgtct";$revcomp = lc($dna);$revcomp = reverse($revcomp);$revcomp =~ tr/acgt/tgca/;print $revcomp;
agacctaacgtggt
Start by making string lower caseagain. This is generally good practice
Reverse the string
Replace 'a' with 't', 'c' with 'g','g' with 'c' and 't' with 'a'
• A common operation due to double-helix symmetry of DNA
Arrays• An array is a variable holding a list of items
• We can think of this as a list with 4 entries
@nucleotides = ('a', 'c', 'g', 't');print "Nucleotides: @nucleotides\n";
Nucleotides: a c g t
a c g telement 0
element 1 element 2 element 3
the array is theset of all four elements
Note that the elementindices start at zero.
Array literals
• There are several, equally valid ways to assign an entire array at once.
@a = (1,2,3,4,5);print "a = @a\n";@b = ('a','c','g','t');print "b = @b\n";@c = 1..5;print "c = @c\n";@d = qw(a c g t);print "d = @d\n";
a = 1 2 3 4 5b = a c g tc = 1 2 3 4 5d = a c g t
This is the most common: a comma-separated list, delimited by parentheses
Accessing arrays
• To access array elements, use square brackets; e.g. $x[0] means "element zero of array @x"
• Remember, element indices start at zero!• If you use an array @x in a scalar context, such
as @x+0, then Perl assumes that you wanted the length of the array.
@x = ('a', 'c', 'g', 't');print $x[0], "\n";$i = 2;print $x[$i], "\n";
ag
@x = ('a', 'c', 'g', 't');print @x + 0;
4
Array operations• You can sort and reverse arrays...
• You can read the entire contents of a file into an array (each line of the file becomes an element of the array)
@x = ('a', 't', 'g', 'c');@y = sort @x;@z = reverse @y;print "x = @x\n";print "y = @y\n";print "z = @z\n";
x = a t g cy = a c g tz = t g c a
open FILE, "sequence.txt";@x = <FILE>;
push, pop, shift, unshift
@x = (‘A’, ‘T’, ‘W’);print "I started with @x\n";$y = pop @x;push @x, ‘G’;print "Then I had @x\n";$z = shift @x;unshift @x, ‘C’;print "Now I have @x\n";print "I lost $y and $z\n";
I started with A T WThen I had A T GNow I have C T GI lost W and A
pop removes the lastelement of an array
push adds an elementto the end of an array
shift removes the firstelement of an array
unshift adds an elementto the start of an array
foreach
• Finding the total of a list of numbers:
• Equivalent to:
@val = (4, 19, 1, 100, 125, 10);$total = 0;foreach $x (@val) { $total += $x;}print $total; 259
@val = (4, 19, 1, 100, 125, 10);$total = 0;for ($i = 0; $i < @val; ++$i) { $total += $val[$i];}print $total; 259
foreach statementloops through eachentry in an array
The @ARGV array
• A special array is @ARGV• This contains the command-line
arguments when the program is invoked at the Unix prompt
• It's a way for the user to pass information into the program
Exploding a sequence into an array
• The programming language C treats all strings as arrays
$dna = "accggtgtgcg";print "String: $dna\n";@array = split( //, $dna);print "Array: @array\n";
String: accggtgtgcgArray: a c c g g t g t g c g
The split statement turnsa string into an array.Here, it splits after everycharacter, but we can alsosplit at specific points,like a restriction enzyme
Taking a slice of an array
• The syntax @x[i,j,k...] returns a (3-element) array containing elements i,j,k... of array @x
@nucleotides = ('a', 'c', 'g', 't');@purines = @nucleotides[0,2];@pyrimidines = @nucleotides[1,3];print "Nucleotides: @nucleotides\n";print "Purines: @purines\n";print "Pyrimidines: @pyrimidines\n";
Nucleotides: a c g tPurines: a gPyrimidines: c t
Finding elements in an array
• The grep command is used to select some elements from an array
• The statement grep(EXPR,LIST) returns all elements of LIST for which EXPR evaluates to true (when $_ is set to the appropriate element)
• e.g. select all numbers over 100:
@numbers = (101, 235, 10, 50, 100, 66, 1005);@numbersOver100 = grep ($_ > 100, @numbers);print "Numbers: @numbers\n";print "Numbers over 100: @numbersOver100\n";
Numbers: 101 235 10 50 100 66 1005Numbers over 100: 101 235 1005
Applying a function to an array
• The map command applies a function to every element in an array
• Similar syntax to list: map(EXPR,LIST) applies EXPR to every element in LIST
• Example: multiply every number by 3
@numbers = (101, 235, 10, 50, 100, 66, 1005);@numbersTimes3 = map ($_ * 3, @numbers);print "Numbers: @numbers\n";print "Numbers times 3: @numbersTimes3\n";
Numbers: 101 235 10 50 100 66 1005Numbers times 3: 303 705 30 150 300 198 3015
Review: pattern-matching
• The following code:
prints the string "Found MCB binding site!" if the pattern "ACGCGT" is present in the default variable, $_
• Instead of using $_ we can "bind" the pattern to another variable (e.g. $dna) using this syntax:
• We can replace the first occurrence of ACGCGT with the string _MCB_ using the following syntax:
• We can replace all occurrences by appending a 'g':
if (/ACGCGT/) { print "Found MCB binding site!\n"; }
if ($dna =~ /ACGCGT/) { print "Found MCB binding site!\n"; }
$dna =~ s/ACGCGT/_MCB_/;
$dna =~ s/ACGCGT/_MCB_/g;