beginning perl for bioinformatics-rvs
DESCRIPTION
Perl is a interpreted language optimized for scanning arbitrary text files, extracting information from those text files, and printing reports based on that information. It's also a good language for many system management tasks.TRANSCRIPT
PRACTICAL EXTRACTION & REPORT LANGUAGE
By Raghvendra Sachan
Raghvendra Sachan
CONTENTS
SL.NO. TOPIC PAGE NO.
1.0 INTRODUCTION TO PERL 2
1.1 PERL FACT! 2
1.2 WHY PERL? 3
2.0 HISTORY OF PERL 3
3.0 BIOINFORMATICS (GENERAL VIEW) 5
4.0 BIOINFORMATICS USING PERL 5
4.1 PROGRAMMING CONCEPTS 5
4.2 VARIABLE 7
4.3 STRING OPERATION 7
5.0 PERL PROGRAMS 8
5.1 TO FIND OUT THE FIRST ORF IN THE GIVEN AMINO ACID SEQUENCE 8
5.2 TO FIND OUT 6 ORF’s IN THE GIVEN DNA SEQUENCE 11
5.3 TO DETERMINE THE INFORMATION ABOUT 20 AMINO ACIDS 14
5.4 TO DETERMINE THE INFORMATION ABOUT NUCLEOTIDES. 22
5.5 TO DETERMINE THE MOLECULAR WEIGHT OF THE AMINO ACIDS SEQUENCE 25
5.6 TO DETERMINE MOLECULAR FORMULA OF THE AMINO ACIDS SEQUENCE. 28
5.7 TO FIND THE REVERSE, COMPLIMENTARY, SEQUENCE. 31
5.8 TO IDENTIFY THE NUMBER OF NUCLEOTIDES IN THE SEQUENCE. 32
5.9 TO IDENTIFY THE NUMBER OF NUCLEOTIDES AND LENGTH IN THE SEQUENCE 33
5.10 TO DETERMINE MOL. WT. OF THE DNA SEQ. USING FIL EHANDLING 34
6.0 APPENDIX 40
6.1 WHAT IS PERL? 40
6.2 VARIABLE & DATA TYPES 40
6.3 QUOTES AND STRINGS 41
6.4 OPERATORS 41
6.5 TESTING 42
6.6 BOOLEAN EXPRESSIONS 43
6.7 INPUT PERL FUNCTIONS 44
7.0 CONCLUSION 48
1.0 Introduction to Perl
1
Perl is a interpreted language optimized for scanning arbitrary text files, extracting
information from those text files, and printing reports based on that information. It's
also a good language for many system management tasks. The language is intended to
be practical (easy to use, efficient, complete) rather than beautiful (tiny, elegant,
minimal). It combines (in the author's opinion, anyway)some of the best features of
C, sed, awk, and sh, so people familiar with those languages should have little difficulty
with it. (Language historians will also note some vestiges of csh, Pascal, and even
BASIC|PLUS.) Expression syntax corresponds quite closely to C expression syntax.
http://www.activestate.com/Products/ActivePerl/
This is the officially blessed version of Perl for Windows. It is released by Active State.
Active Perl can be downloaded for free, or we can order the ActiveCD from them. It
comes with a wealth of widely used third-party libraries such as Tk, LWP, and the XML
bundle.
Whatever operating system we are on, this is a valid choice. Especially if it happen to be
on a UNIX-based operating system such as Linux, FreeBSD, Windows or Mac OS X.
The official documentation system for Perl is POD, or "Plain Old Documentation". It is
powerful and widely used.
1.1 Perl Facts
Perl is a stable, cross platform programming language.
It is used for mission critical projects in the public and private sectors.
Perl is Open Source software, licensed under its Artistic License, or the GNU
General Public License (GPL).
Perl was created by Larry Wall.
Perl 1.0 was released to usenet's alt.comp.sources in 1987
PC Magazine named Perl a finalist for its 1998 Technical Excellence Award in the
Development Tool category.
1.2 Why Perl?
Perl takes the best features from other languages, such as C, awk, sed, sh, and
BASIC, among others.
2
Perl database integration interface (DBI) supports third-party databases including
Oracle, Sybase, Postgres, MySQL and others.
Perl works with HTML, XML, and other mark-up languages.
Perl supports Unicode.
Perl is Y2K compliant.
Perl supports both procedural and object-oriented programming.
Perl interfaces with external C/C++ libraries through XS or SWIG.
Perl is extensible. There are over 500 third party modules available from the
Comprehensive Perl Archive Network (CPAN).
The Perl interpreter can be embedded into other systems.
2.0 HISTORY OF PERL
-- Larry Wall when asked if he learned Perl from the perl source
PERL 1.000
Perl 1.000 is unleashed upon the world. Some People take Perls' Birthday seriously.
Behold as Randal sings Happy Birthday to Larrys' answering machine. The description
from the original man page sums up this new language well. (18 December)
PERL 2.000
Perl 2.000 released. (5 June) Some of the enhancements from Perl1 included:
New regexp routines derived from Henry Spencer's.
Support for /(foo|bar)/.
Support for /(foo)*/ and /(foo)+/.
\s for whitespace, \S for non-, \d for digit, \D nondigit
PERL 3.000
Perl 3.000 is released and is distributed by Larry for the first time under the terms of the
GNU Public License. A few of the new features: (18 Oct)
Perl can now handle binary data correctly and has functions to pack and unpack
binary structures into arrays or lists. You can now do arbitrary ioctl functions.
You can now pass things to subroutines by reference.
Debugger enhancements.
PERL 4.000
Perl 4.000 is released and includes an artistic license as well as the GPL. (21 March)
3
Linus Torvalds releases the first version of Linux. Linus had wanted to name it Freax
(free + freak + unix) but the site administrator liked Linux better. It was distributed under
the GNU Public License. (July).
PERL 5.000
The much anticipated Perl 5.000 is unveiled. It was a complete rewrite of Perl.
A few of the features and pitfalls are: (18 October)
Objects.
The documentation is much more extensive and perldoc along with pod is
introduced.
Lexical scoping available via my. eval can see the current lexical variables.
The preferred package delimiter is now :: rather than '.
New functions include: abs(), chr(), uc(), ucfirst(), lc(), lcfirst(),
chomp(), glob()
There is now an English module that provides human readable translations for
cryptic variable names.
Several previously added features have been subsumed under the new keywords use
and no.
Pattern matches may now be followed by an m or s modifier to explicitly request
multiline or singleline semantics. An s modifier makes . match newline.
@ now always interpolates an array in double-quotish strings. Some programs may
now need to use backslash to protect any @ that shouldn't interpolate.
It is no longer syntactically legal to use whitespace as the name of a variable, or as a
delimiter for any kind of quote construct.
The -w switch is much more informative.
is now a synonym for comma. This is useful as documentation for arguments that
come in pairs, such as initializers for associative arrays, or named arguments to a
subroutine.
Perl 5.001 is released. (13 March)
Perl 5.002 announced which introduced, among other things, subroutine prototypes and
sysopen(). (29 February)
4
3.0 Bioinformatics Definition -General view
Bioinformatics derives knowledge from computer analysis of biological data. These can
consist of the information stored in the genetic code, but also experimental results from
various sources, patient statistics, and scientific literature. Research in bioinformatics
includes method development for storage, retrieval, and analysis of the data.
Bioinformatics is a rapidly developing branch of biology and is highly interdisciplinary,
using techniques and concepts from informatics, statistics, mathematics, chemistry,
biochemistry, physics, and linguistics. It has many practical applications in different
areas of biology and medicine.
4.0 Bioinformatics using Perl
Bioinformatics, the use of computers in biology research, has been increasing in
importance during the past decade as the Human Genome Project went from its
beginning to the announcement last year of a "draft" of the complete sequence of human
DNA.
The importance of programming in biology stretches back before the previous decade.
And it certainly has a significant future now that it is a recognized part of research into
many areas of medicine and basic biological research. This may not be news to
biologists. But Perl programmers may be surprised to find that their handsome language
has become one of the most - if not the most popular - of computer languages used in
bioinformatics.
4.1 Programming Concepts
Program = a text file that contains instructions for the computer to follow
Programming Language = a set of commands that the computer understands
(via a “command interpreter”)
Input = data that is given to the program
Output = something that is produced by the program
Programming
Write the program (with a text editor)
Run the program
Look at the output
Correct the errors (debugging)
5
Repeat
(computers are VERY dumb -they do exactly what you tell them to do, so be
careful what you ask for…)
String
Text is handled in Perl as a string
This basically means that you have to put quotes around any piece of text that is
not an actual Perl instruction.
Perl has two kinds of quotes - single ‘ ‘
and double “ “
(they are different- more about this later)
Perl uses the term “print” to create output
Without a print statement, you won’t know what your program has done
You need to tell Perl to put a carriage return at the end of a printed line
o Use the “\n” (newline) command
o Include the quotes
o The “\” character is called an escape - Perl uses it a lot
Numbers and Functions
Perl handles numbers in most common formats:
456
5.6743
6.3E-26
Mathematical functions work pretty much as you would expect:
4+7,6*4 ,43-27, 256/12,2/(3-5)
4.2 Variable
To be useful at all, a program needs to be able to store information from one
line to the next
Perl stores information in variables
A variable name starts with the “$” symbol, and it can store strings or
numbers
6
o Variables are case sensitive
o Give them sensible names
Use the “=”sign to assign values to variables
$a = 100
$s = “ttattagcc”
4.3 String operation
Strings (text) in variables can be used for some math-like operations
Concatenate (join) use the dot . operator
$seq1= “ACTG”;
$seq2= “GGCTA”;
$seq3= $seq1 . $seq2;
print $seq3
ACTGGGCTA
String comparison (are they the same, > or <)
eq (equal )
ne (not equal )
ge (greater or equal )
gt (greater than )
lt (less than )
le (less or equal )
5.0 PERL PROGRAMS
5.1 PROGRAM NO.1
print "ENTER THE m-RNA SEQUENCE\n";
$a=<stdin>;
chomp($a);
$len=length($a);
print" THE LENGTH OF DNA SEQUENCE IS $len\n";
7
$c=0;
$g='';
while ($c<$len){
$b=substr($a,$c,3);
if ($b=~ /AUG/)
{
$g=$g.'M';
}
if ($b=~ /(UUA)|(UUG)|(CUU)|(CUC)|(CUA)|(CUG)/)
{
$g=$g.'L';
}
if ($b=~ /(UCU)|(UCC)|(UCA)|(UCG)|(AGU)|(ACG)/)
{
$g=$g.'S';
}
if ($b=~ /(AUU)|(AUC)|(AUA)/)
{
$g=$g.'I';
}
if ($b=~ /(UUU)|(UUC)/)
{
$g=$g.'F';
}
if ($b=~ /(GUU)|(GUC)|(GUA)|(GUG)/)
{
$g=$g.'V';
}
if ($b=~ /(CCU)|(CCC)|(CCA)|(CCG)/)
{
$g=$g.'P';
8
}
if ($b=~ /(ACU)|(ACC)|(ACA)|(ACG)/)
{
$g=$g.'T';
}
if ($b=~ /(GCU)|(GCC)|(GCA)|(GCG)/)
{
$g=$g.'A';
}
if ($b=~ /(UAU)|(UAC)/)
{
$g=$g.'Y';
}
if ($b=~ /(UGU)|(UGC)/)
{
$g=$g.'C';
}
if ($b=~ /UGG/)
{
$g=$g.'W';
}
if ($b=~ /(CAU)|(CAC)/)
{
$g=$g.'H';
}
if ($b=~ /(CAA)|(CAG)/)
{
$g=$g.'Q';
}
if ($b=~ /(CGU)|(CGC)|(CGA)|(AGG)|(AGA)|(AGG)/)
{
9
$g=$g.'R';
}
if ($b=~ /(AAU)|(AAC)/)
{
$g=$g.'N';
}
if ($b=~ /(AAA)|(AAG)/)
{
$g=$g.'K';
}
if ($b=~ /(CGU)|(GGC)|(GGA)|(GGG)/)
{
$g=$g.'G';
}
if ($b=~ /(GAA)|(GAG)/)
{
$g=$g.'E';
}
if ($b=~ /(GAU)|(GAC)/)
{
$g=$g.'D';
}
if ($b=~ /(UAA)|(UAG)(UGA)/)
{
$g=$g.'#';
}
$c=$c+3;
}
print"THE AMINO ACID IN THE SEQUENCE IN 1ST ORF IS $g";
10
RESULT
ENTER THE m-RNA SEQUENCE
AUCGAUCGAUGC
THE LENGTH OF DNA SEQUENCE IS 12
THE AMINO ACID IN THE SEQUENCE IN THE 1ST ORF IS IDRC
COMMENT
AIM: TO FIND OUT THE FIRST ORF IN THE GIVEN AMINO ACID SEQUENCE.
5.2 PROGRAM NO.2.
print "ENTER THE DNA SEQUENCE\n";
$dna=<stdin>;
chomp($dna);
$dna1=$dna;
$len=length($dna);
$dna=~tr/ATGC/UACG/;
print"\nmRNA: $dna\n";
print "\nLENGTH: $len\n";
sub dna
{
$i=0;
$b=3;
$p='';
while($i<$len)
{
$seq=substr($dna,$i,$b);
if ($seq=~/GC./i) {$p.='A';}
if ($seq=~/UG[UC]/i) {$p.='C';}
if ($seq=~/GA[UC]/i) {$p.='D';}
11
if ($seq=~/GA[AG]/i) {$p.='E';}
if ($seq=~/UU[UC]/i) {$p.='F';}
if ($seq=~/GG./i) {$p.='G';}
if ($seq=~/CA[UC]/i) {$p.='H';}
if ($seq=~/AU[UCA]/i) {$p.='I';}
if ($seq=~/AA[AG]/i) {$p.='K';}
if ($seq=~/UU[AG]/i) {$p.='L';}
if ($seq=~/AUG/i) {$p.='M';}
if ($seq=~/AA[UC]/i) {$p.='N';}
if ($seq=~/CC./i) {$p.='P';}
if ($seq=~/CA[AG]/i) {$p.='Q';}
if ($seq=~/CG.|AG[AG]/i){$p.='R';}
if ($seq=~/UC.|AG[UC]/i){$p.='S';}
if ($seq=~/AC./i) {$p.='T';}
if ($seq=~/GU./i) {$p.='V';}
if ($seq=~/UGG/i) {$p.='W';}
if ($seq=~/UA[UC]/i) {$p.='Y';}
if ($seq=~/CU./i) {$p.='L';}
if ($seq=~/UA[AG]|UGA/i){$p.='*';}
$i=$i+3;
}
return $p;
}
print"\nFIRST READING FRAME ";
$q=dna();
print": $q\n";
print"\nSECOND READING FRAME ";
$dna=substr($dna,1,$len);
$p=dna();
print": $p\n";
print"\nTHIRD READING FRAME ";
12
$dna=substr($dna,1,$len);
$x=dna();
print": $x\n";
$rev=reverse($dna1);
$rev=~ tr/ACTG/UGAC/;
print "\nREVERSE mRNA : $rev\n ";
print"\nFOURTH READING FRAME ";
$q1=dna();
print": $q1\n";
print"\nFIFTH READING FRAME ";
$dna=substr($dna,1,$len);
$p1=dna();
print": $p1\n";
print"\nSIXTH READING FRAME ";
$dna=substr($dna,1,$len);
$x1=dna();
print": $x1\n";
RESULT
ENTER THE DNA SEQUENCE
ATGCGTGACATG
mRNA : UACGCACUGUAC
LENGTH 12
FIRST READING FRAME : YALY
SECOND READING FRAME : THC
THIRD READING FRAME : RTV
REVERSE mRNA : CAUGUCACGCAU
FOURTH READING FRAME : RTV
FIFTH READING FRAME : ALY
SIXTH READING FRAME : HC
COMMMENT
AIM: TO FIND OUT 6 ORF’s IN THE GIVEN DNA SEQUENCE.
13
5.3 PROGRAM NO.3
do{
print"*" x 50;
print "\nEnter E for ESSENTIAL AMINO ACIDS\n";
print "Enter N for NONESSENTIALS\n";
print"*" x 50;
$a=<stdin>;
chomp($a);
if ($a eq 'E')
{
print "Isoleucine(I)\n
Leucine(L)\n
Lysine(K)\n
Methionine(M)\n
Phenylalanine(F)\n
Threonine(T)\n
Tryptophan(W)\n
Valine(V)\n
Arginine(R)\n
Histidine(H)\n";
}
if($a eq 'N')
{
print "Alanine(A)\n
Asparagine(N)\n
Aspartate(D)\n
Cysteine(C)\n
Glutamate(E)\n
Glutamine(Q)\n
Glycine(G)\n
14
Proline(P)\n
Serine(S)\n
Tyrosine(Y)\n";
}
$b= <stdin>;
chomp($b);
if ($b eq 'I')
{
print "Isoleucine\n
Chemical formula: C6H13NO2\n
Molecular mass: 131.18 [1] g•mol-1\n
Systematic name:\n
(2S,3S)-2-amino-3-methylpentanoic acid\n
Abbreviations: I, Ile\n
Synonyms:\n
{2/α}-amino-{3/β}-methylvaleric acid\n
3-methyl-{/erythro-}norvaline\n
Amino-sec-butyl-acetic acid\n
Amino(1-methylpropyl)-acetic acid\n";
}
if ($b eq 'L')
{
print"Leucine\n
Chemical formula: C6H13NO2\n
Molecular mass: 131.18 g•mol-1\n
Systematic name:\n
(S)-2-amino-4-methyl-pentanoic acid\n
Abbreviations: L, Leu\n
Synonyms:\n
{(S)-/L-}2-amino-4-methylvaleric acid\n
4-methyl-norvaline\n
15
α-aminoisocaproic acid\n";
}
if ($b eq 'K')
{
print"Lysine\n
Systematic name (S)-2,6-Diaminohexanoic acid\n
Abbreviations Lys,k\n
Chemical formula C6H14N2O2\n
Molecular mass 146.19 g/mol\n
PubChem 876\n
Melting point 224 °C\n";
}
if ($b eq 'M')
{
print"Methionine\n
Systematic name (S)-2-amino-4-(methylsulfanyl)-\n
butanoic acid\n
Abbreviations Met,m\n
Chemical formula C5H11NO2S\n
Molecular mass 149.21 g mol-1\n
Melting point 281 °C\n";
}
if ($b eq 'F')
{
print "Phenylalanine\n
Systematic name 2-Amino-3-phenyl-propanoic acid\n
Abbreviations Phe,F\n
Chemical formula C9H11NO2\n
Molecular mass 165.19 g mol-1\n
Melting point 283 °C\n";
}
16
if ($b eq 'T')
{
print" Threonine\n
Systematic name (2S,3R)-2-Amino-3-hydroxybutanoic acid\n
Abbreviations Thr,T\n
Chemical formula C4H9NO\n
Molecular mass 119.12 g mol-1\n
Melting point 256 °C\n";
}
if ($b eq 'W')
{
print" Tryptophan\n
Systematic name (S)-2-Amino-3-(1H-indol-3-yl)-propionic acid\n
Abbreviations Trp,W\n
Chemical formula C11H12N2O2\n
Molecular mass 204.23 g mol−1\n
Melting point 289 °C";\n
}
if ($b eq 'W')
{
print" Valine\n
Systematic name (S)-2-amino-3-methyl-butanoic acid\n
Abbreviations Val,V\n
Chemical formula C5H11NO2\n
Molecular mass 117.15 g mol-1\n
Melting point 315 °C\n";
}
if ($b eq 'R')
{
print"Arginine\n
17
Systematic (IUPAC) name
2-amino-5-(diaminomethylidene
amino)pentanoic acid\n
Chemical data\n
Formula C6H14N4O2\n
Mol. weight 174.2\n";
}
if ($b eq 'H')
{
print" Histidine\n
Systematic (IUPAC) name\n
2-amino-3-(3H-imidazol-4-yl)propanoic acid\n
Chemical data\n
Formula C6H9N3O2\n
Mol. weight 155.16\n";
}
if ($b eq 'A')
{
print" Alanine\n
Systematic (IUPAC) name\n
(S)-2-aminopropanoic acid\n
Chemical data\n
Formula C3H7NO\n
Mol. weight 89.1\n";
}
if ($b eq 'N')
{
print"Asparagine\n
Systematic (IUPAC) name\n
(2S)-2-amino-3-carbamoyl-propanoic acid\n
Chemical data\n
18
Formula C4H8N2O3\n
Mol. weight 132.118\n";
}
if ($b eq 'C')
{
print "Cysteine\n
Systematic (IUPAC) name\n
(2R)-2-amino-3-sulfanyl-propanoic acid\n
Chemical dat\n
Formula C3H7NO2S\n
Mol. weight 121.16\n";
}
if ($b eq 'A')
{
print"Aspartic acid\n
Systematic (IUPAC) name\n
(2S)-2-aminobutanedioic acid\n
Chemical data\n
Formula C4H7NO4\n
Mol. weight 133.10\n";
}
if ($b eq 'E')
{
print"Glutamic acid\n
Systematic (IUPAC) name\n
(2S)-2-aminopentanedioic acid\n
Chemical data\n
Formula C5H9NO4\n
Mol. weight 147.13\n";
}
if ($b eq 'Q')
19
{
print" Glutamine\n
Systematic (IUPAC) name\n
(2S)-2-amino-4-carbamoyl-butanoic acid\n
Chemical data\n
Formula C5H10N2O3\n
Mol. weight 146.15\n";
}
if ($b eq 'G')
{
print" Glycine\n
Systematic (IUPAC) name\n
aminoethanoic acid\n
Chemical data\n
Formula C2H5NO2\n
Mol. weight 75.07\n";
}
if ($b eq 'P')
{
print" Proline\n
Systematic name (S)-Pyrrolidine-2-carboxylic acid\n
Abbreviations Pro,P\n
Chemical formula C5H9NO2\n
Molecular mass 115.13 g mol-1\n
Melting point 221 °C\n";
}
if ($b eq 'S')
{
print" Serine\n
Systematic name (S)-2-amino-3-hydroxypropanoic acid\n
20
Abbreviations Ser,S\n
Chemical formula C3H7NO3\n
Molecular mass 105.09 g mol-1\n
Melting point 228 °C \n";
}
if ($b eq 'Y')
{
print"Tyrosine\n
Systematic name (S)-2-Amino-3-(4-hydroxy-phenyl)-propanoic acid\n
Abbreviations Tyr,Y\n
Chemical formula C9H11NO3\n
Molecular mass 181.19 g mol-1\n
Melting point 343 °C\n";
}
print "\nEnter Again press Y";
$y=<stdin>;
chomp($y);
)
while($y eq 'Y')
RESULT
ENTER E FOR ESSENTIAL AMINO ACIDS
ENTER N FOR NON ESSENTIALS
E
LIST OF ESSENTIAL AMINO ACIDS
I
Isoleucine
Chemical formula: C6H13NO2
Molecular mass: 131.18 [1] g·mol-1
Systematic name:
21
(2S,3S)-2-amino-3-methylpentanoic acid
Abbreviations: I, Ile
Synonyms:
{2/α}-amino-{3/β}-methylvaleric acid
3-methyl-{/erythro-}norvaline
Amino-sec-butyl-acetic acid
Amino(1-methylpropyl)-acetic acid
To Start Again Press Y
COMMENT
AIM:TO DETERMINE THE INFORMATION ABOUT 20 AMINO ACIDS.
5.4 PROGRAM NO.4
print"*" x 30;
print "\nEnter 1 for ADENINE\n";
print "Enter 2 for GUANINE\n";
print "Enter 3 for THYMINE\n";
print "Enter 4 for CYTOSINE\n";
print "ENTER 5 for URACIL\n";
print "ENTER YOUR CHOICE\n";
$a =<stdin>;
if($a==1)
{
print "ADENINE\n
Systematic (IUPAC) name 7H-purin-6-amine\n
Synonyms 6-aminopurine\n
Identifiers CAS number 73-24-5 PubChem 190\n
Chemical data\n
Formula C5H5N5\n
Mol. weight 135.127\n
SMILES NC1=NC=NC2=C1N=CN2\n
22
Physical data\n
Melt. point\n
360 - 365 °C (-265 °F)\n";
}
if ($a==2)
{
print "GUANINE\n
Systematic name 2-amino-1H-purin-6(9H)-one\n
Other names 2-amino-6-oxo-purine,2-aminohypoxanthine\n
Molecular formula C5H5N5O\n
SMILES NC(NC1=O)=NC2=C1N=CN2\n
Molar mass 151.1261 g/mol\n
Appearance White amorphous solid\n
CAS number [73-40-5]\n
Melting point 360°C (633.15 K) deco.\n
Boiling point Sublimes\n";
}
if ($a==3)
{
print "THYMINE\n
Chemical name 5-Methylpyrimidine-2,4(1H,3H)-dione\n
Chemical formula C5H6N2O2\n
Molecular mass 126.11334 g/mol\n
Melting point 316 - 317 °C\n
CAS number 65-71-4\n
SMILES CC1=CNC(NC1=O)=O\n";
}
if ($a==4)
{
print "CYTOSINE\n
Chemical name 4-Aminopyrimidin-2(1H)-one\n
23
Chemical formula C4H5N3O\n
Molecular mass 111.102 g/mol\n
Melting point 320 - 325°C (decomp)\n
CAS number 71-30-7\n
SMILES NC1=NC(NC=C1)=O\n";
}
if ($a==5)
{
print "URACIL\n
Systematic name Pyrimidine-2,4(1H,3H)-dione\n
Other names Uracil, 2-oxy-4-oxy pyrimidine\n
Molecular formula C4H4N2O2\n
Molar mass 112.08676 g/mol\n
Appearance Solid\n
CAS number [66-22-8]\n
Melting point 335 °C (608 K)\n
Boiling point N/A\n
Acidity (pKa) basic pKa = -3.4\n
acidic pKa = 9.389\n";
}
print "\nTo Start Again press Y";
$y=<stdin>;
chomp($y);
}
while($y eq 'Y')
RESULT
Enter 1 for ADENINE
Enter 2 for GUANINE
Enter 3 for THYMINE
Enter 4 for CYTOSINE
ENTER 5 for URACIL
24
ENTER YOUR CHOICE
1
ADENINE
Systematic (IUPAC) name 7H-purin-6-amine
Synonyms 6-aminopurine
Identifiers CAS number 73-24-5 PubChem 190
Chemical data
Formula C5H5N5
Mol. weight 135.127
SMILES NC1=NC=NC2=C1N=CN2
Physical data
Melt. point
360 - 365 °C (-265 °F)
To Start Again Press Y
COMMENT
AIM: TO DETERMINE THE INFORMATION ABOUT NUCLEOTIDES.
5.5 PROGRAM NO.5
print "ENTER THE AMINO ACID SEQUENCE\n";
$a=<stdin>;
chomp($a);
$x=length($a);
print "LENGTH:$x ";
@a=split('',$a);
$b= 0;
foreach $i(@a){
if($i eq 'G'){
$b = $b+75.07;
}
if($i eq 'A'){
$b = $b+89.09;
25
}
if($i eq 'V'){
$b = $b+117.15;
}
if($i eq 'L'){
$b = $b+131.18;
}
if($i eq 'I'){
$b = $b+131.18;
}
if($i eq 'S'){
$b = $b+105.09;
}
if($i eq 'T'){
$b = $b+119.12;
}
if($i eq 'C'){
$b = $b+121.15;
}
if($i eq 'M'){
$b = $b+149.21;
}
if($i eq 'F'){
$b = $b+165.19;
}
if($i eq 'Y'){
$b = $b+181.19;
}
if($i eq 'W'){
$b = $b+204.23;
}
26
if($i eq 'P'){
$b = $b+115.13;
}
if($i eq 'N'){
$b = $b+132.12;
}
if($i eq 'Q'){
$b = $b+146.15;
}
if($i eq 'D'){
$b = $b+133.10;
}
if($i eq 'E'){
$b = $b+147.13;
}
if($i eq 'K'){
$b = $b+146.19;
}
if($i eq 'H'){
$b = $b+155.16;
}
if($i eq 'R'){
$b = $b+174.20;
}
}
$c=$b-(18*($x-1));
print "The MOLECULAR WEIGHT of the sequence is $c";
RESULT
ENTER THE AMINO ACID SEQUENCE
AVLIST
LENGTH:4
27
THE MOLECULAR WEIGHT OF THE SEQUENCE IS 414.16
COMMENT
AIM:TO DETERMINE THE MOLECULAR WEIGHT OF THE AMINO ACIDS
SEQUENCE.
5.6 PROGRAM NO.6
$b= <stdin>;
chomp($b);
if ($b eq 'G')
{
print " GLYCINE=C2H5NO2";
}
if ($b eq 'A')
{
print " ALANINE=C3H7NO2";
}
if ($b eq 'V')
{
print " VALINE=C5H11NO2";
}
if ($b eq 'L')
{
print " LEUCINE = C6H13NO2";
}
if ($b eq 'I')
{
print " ISOLEUCINE=C6H13NO2";
}
if ($b eq 'S')
{
28
print " SERINRE = C3H7NO3";
}
if ($b eq 'T')
{
print " THREONINE = C4H9NO3";
}
if ($b eq 'C')
{
print " CYSTINE = C3H7NO2S";
}
if ($b eq 'M')
{
print " METHIONINE = C5H11NO2S";
}
if ($b eq 'F')
{
print " PHENYLALANINE = C9H11NO2";
}
if ($b eq 'Y')
{
print " TYROSINE = C9H11NO3";
}
if ($b eq 'W')
{
print " TRYPTOPHAN = C11H12N2O2";
}
if ($b eq 'P')
{
print " PROLINE = C5H9NO2";
}
29
if ($b eq 'N')
{
print " ASPARAGINE = C4H8N2O3";
}
if ($b eq 'Q')
{
print " GLUTAMINE = C5H10N2O3";
}
if ($b eq 'D')
{
print " ASPARTIC ACID = C4H7NO4";
}
if ($b eq 'E')
{
print " GLUTAMIC ACID = C5H9NO4";
}
if ($b eq 'K')
{
print " LYSINE = C6H14N2O2";
}
if ($b eq 'H')
{
print " HISTIDINE = C6H9N3O2";
}
if ($b eq 'R')
{
print " ARGININE = C6H14N4O2";
}
RESULT
A
ALANINE= C3H7NO2
30
COMMENT
AIM:TO DETERMINE THE MOLECULAR FORMULA OF THE AMINO ACIDS
SEQUENCE.
5.7 Program no. 7.
$a=<stdin>;
chomp($a);
print "original seq $a\n";
$a= reverse $a;
print" reverse seq $a\n";
$a=~ tr/ATGC/TACG/;
print "COMPLIMENTARY seq $a\n";
$a= reverse $a;
print "reverse complimentary $a\n";
RESULT
ATGC
ORIGINAL SEQUENCE ATGC
REVERSE SEQUENCE CGTA
COMPLIMENTARY SEQUENCE GCAT
REVERSE COMPLIMENTARY SEQUENCE TACG
COMMENT
AIM: TO FIND THE REVERSE SEQUENCE, COMPLIMENTARY SEQUENCE,
REVERSE COMPLIMENTARY SEQUENCE.
5.8 PROGRAM NO.8
$a=<stdin>;
chomp($a);
$l= length($a);
@a= split('',$a);
31
$A =0;
$T =0;
$C =0;
$G =0;
foreach $i(@a){
if ($i eq 'A')
{
$A=$A+1;
}
if ($i eq 'T')
{
$T=$T+1;
}
if ($i eq 'C')
{
$C=$C+1;
}
if ($i eq 'G')
{
$G=$G+1;
}
}
print "Adenine = $A";
print "Cytosine= $C";
print "Guanine = $G";
print "Thymine = $T";
print "length= $l";T";
print "length= $l";
RESULT
ATCG
Adenine = 1 Cytosine= 1Guanine = 1Thymine1
32
COMMENT
AIM: TO IDENTIFY THE NUMBER OF NUCLEOTIDES IN THE SEQUENCE.
5.9 PROGRAM NO 9
$a=<stdin>;
chomp($a);
$l= length($a);
@a= split('',$a);
$A =0;
$T =0;
$C =0;
$G =0;
foreach $i(@a){
if ($i eq 'A')
{
$A=$A+1;
}
if ($i eq 'T')
{
$T=$T+1;
}
if ($i eq 'C')
{
$C=$C+1;
}
if ($i eq 'G')
{
$G=$G+1;
}
}
print "Adenine = $A\n";
33
print "Cytosine= $C\n";
print "Guanine = $G\n";
print "Thymine = $T\n";
print "length= $l\n";
RESULT
ATCG
Adenine = 1 Cytosine= 1Guanine = 1Thymine 1
LENGTH 4
COMMENT
AIM: TO IDENTIFY THE NUMBER OF NUCLEOTIDES AND THE LENGTH IN
THE SEQUENCE.
5.10 PROGRAM NO.10
$dna="D:/orf/as.txt";;
open(DNA,$dna);
@dna=<DNA>;
$dna=join('',@dna);
$dna=~s/\s//g;
print "$dna";
do{
sub A
{
$l= length($dna);
print"$l\n";
}
sub B
34
{
$dna= reverse ($dna);
print" reverse seq $dna\n";
}
sub C
{
$dna=~ tr/ATGC/TACG/;
print"COMPLIMENTARY seq $dna\n";
}
sub D
{
$dna=~ tr/ATGC/TACG/;
$dn= reverse ($dna);
print "reverse complimentary $dn\n";
}
sub E
{
@a= split('',$dna);
$A =0;
$T =0;
$C =0;
$G =0;
foreach $i(@a){
35
if ($i eq 'A')
{
$A=$A+1;
}
if ($i eq 'T')
{
$T=$T+1;
}
if ($i eq 'C')
{
$C=$C+1;
}
if ($i eq 'G')
{
$G=$G+1;
}
}
print "Adenine = $A\n";
print "Cytosine= $C\n";
print "Guanine = $G\n";
print "Thymine = $T\n";
$e= ($A* 313.21) + ($T* 288.20) + ($G * 329.21) + ($C* 289.19) - (18.02);
print "$e";
36
}
print"\nenter 1 for length";
print"\nenter 2 for reverse\n";
print"enter 3 for complimentary\n";
print"enter 4 for reverse complimentary\n";
print"enter 5 for molecular weight of the sequence\n";
$a =<stdin>;
if($a==1)
{
A;
}
if($a==2)
{
B;
}
if($a==3)
{
C;
}
if($a==4)
{
D;
}
37
if($a==5)
{
E;
}
print "\nTo Start Again Press Y";
$y=<stdin>;
chomp($y);
}
while($y eq 'Y')
RESULT
enter 1 for length
enter 2 for reverse
enter 3 for complimentary
enter 4 for reverse complimentary
enter 5 for molecular weight of the sequence
5
ADENINE=2
THYMINE=2
GUANINE=2
CYTOSINE=2
MOL. WT.= 2421.6
To Start Again Press Y
COMMENT
38
AIM: TO DETERMINE THE LENGTH,REVERSE, COMPLIMENTARY,REVERSE COMPLIMENTARYAND MOLECULAR WEIGHT OF THE GIVEN DNA SEQUENCE USING FILE HANDLING.
6.0 APPENDIX
6.1 What is Perl?
Perl is a high-level programming language with an eclectic heritage written by Larry
Wall and a cast of thousands. It derives from the ubiquitous C programming language
and to a lesser extent from sed, awk, the Unix shell, and at least a dozen other tools and
39
languages. Perl's process, file, and text manipulation facilities make it particularly well-
suited for tasks involving quick prototyping, system utilities, software tools, system
management tasks, database access, graphical programming, networking, and world wide
web programming. These strengths make it especially popular with system administrators
and CGI script authors, but mathematicians, geneticists, journalists, and even managers
also use Perl.
6.2 Variables & Data Types
a variable is a named location in memory that is used to hold data that may be modified
by the program. Perl has three scema for keeping data during program execution: scalars,
arrays of scalars (also known as lists), and hashes. Arrays are grouped scalars indexed by
number, while hashes are indexed by strings.
Scalars
The most basic kind of data structure in Perl is the scalar variable. Scalar variables can
hold both strings and numbers.
$bodytemp = 98.6;
sets the scalar variable $BodyTemp to 98.6, but you can also assign a
string to exactly the same variable:
$bodytemp = 'normal';
Perl will also accept numbers as strings,
$bodytemp = '098.6';
and still performs arithmetic and other operations on them.
Arrays
An array variable is a list of scalars, hence in perl they are often refered to as lists. They
have the same format as scalar variables except that they are prefixed by an @ symbol.
The following statements:
@valine = ("gtg", "gtt", "gta", "gtc");
@hydrophobics = ("valine", "leucine","isoleucine");
@weights(117.15, 131.17, 131.17);
assign a four element list to the array variable @valine and a three element list to the
array variables @hydrophobics and @weights.
Hashes
40
Basically hashes are arrays which are accessed by a string. They are also refered to as
associative arrays.To define a hash we can use the usual parenthesis notation, but the
array itself is prefixed by a % sign. Suppose we want to store all the hydrophobic amino
acids with their molecular weights in a single data structure. It would look like this:
%molyweights = ("valine", 117.15,
"leucine", 131.17,
"isoleucine", 131.17);
@data is a list array that has an element for every string and scalar in the hash
%molyweights.
6.3 Quotes & Strings
\t tab
\n newline
\b backspace
\a alarm (bell)
\$ literal $
\@ literal @
\\ literal
(special characters)
6.4 Operators
Precidence
Use parentheses when in doubt.
Arithmetic Operators
Math in Perl
x**y exponentiation
-x negation
x/y division
x*y multiplication
41
x+y addition
x-y subtraction
Auto-increment and Auto-decrement ++ and -- work as increment and decrement
Assignment Operators is the ordinary assignment operator
String Operators . Concatenates two strings. For example,
$a = 'winter'.'green'; # $a is wintergreen
6.5 Testing
Primarily for Numeric Comparison
== TRUE if the left argument is numerically equal to the
right argument; otherwise FALSE.
!= TRUE if the left argument is numerically not equal to
the right argument; otherwise FALSE
< TRUE if the left argument is numerically less than the
right argument; otherwise FALSE
> TRUE if the left argument is numerically greater than
the right argument; otherwise FALSE
<= TRUE if the left argument is numerically less than or
equal to the right argument; otherwise FALSE
>= TRUE if the left argument is numerically greater than
or equal to the right argument; otherwise FALSE
<=> returns -1, 0, or 1 depending on whether the left
argument is numerically less than, equal to, or greater
than the right argument
Primarily for String and Character Comparison
eq returns TRUE if the left argument is stringwise equal to
the right argument; otherwise FALSE
ne returns TRUE if the left argument is stringwise not equal
42
to the right argument; otherwise FALSE
lt returns TRUE if the left argument is stringwise less than
the right argument; otherwise FALSE
gt returns TRUE if the left argument is stringwise greater
than the right argument; otherwise FALSE
le returns TRUE if the left argument is stringwise less than
or equal to the right argument; otherwise FALSE
ge returns TRUE if the left argument is stringwise greater
than or equal to the right argument; otherwise FALSE
cmp returns -1, 0, or 1 depending on whether the left
argument is stringwise less than, equal to, or greater than
the right argument
6.6 Boolean Expressions
You can also use logical AND, OR and NOT to create more complex expressions:
($a && $b) Are $a AND $b TRUE ?
($a || $b) Is either $a OR $b TRUE ?
!($a) Is $a FALSE ?
6.7 Important Perl Functions
Any function in the list below may be used either with or without parentheses around its
arguments.
Input and Output
print - output a list to the screen or a file
SYNOPSIS
print FILEHANDLE LIST
print LIST
open - open a file
SYNOPSIS
43
open FILEHANDLE, FILENAME
close - close a file
SYNOPSIS
close FILEHANDLE
close
String Functions
length - return the number of bytes in a string
SYNOPSIS
length EXPR
reverse - reverse a string or a list
SYNOPSIS
reverse STRING
reverse LIST
substr - get or alter a portion of a string
SYNOPSIS
substr EXPR,OFFSET,LEN,REPLACEMENT
substr EXPR,OFFSET,LEN
substr EXPR,OFFSET
index - left-to-right substring search
SYNOPSIS
index STR, SUBSTR, POSITION
index STR, SUBSTR
rindex - right-to-left substring search
SYNOPSIS
rindex STR,SUBSTR,POSITION
rindex STR,SUBSTR
Numeric Functions
abs - absolute value function
cos - cosine function
exp - raise e to a power
int - get the integer portion of a number
44
log - retrieve the natural logarithm for a number
sin - return the sin of a number
sqrt - square root function
6.7.1 Metacharacters
Metacharacters are used to broaden the capabilities of a pattern to match multiple strings
or in specific locations. The following are recognized:
. Match any character (except newline)
^ Match the beginning of the line
$ Match the end of the line (or before newline at the end)
| Alternation
( ) Grouping
[ ] Character class
metacharacters
6.7.2 Character Classes: Perl also provides some predefined character classes. The
following can be used in place of their bracketed alternatives:
\w Match a "word" character [a-zA-Z_0-9]
\W Match a non-word character [^a-zA-Z_0-9]
\s Match a whitespace character [ \t\n\r\f]
\S Match a non-whitespace character [^ \t\n\r\f]
\d Match a digit character [0-9]
\D Match a non-digit character [^0-9]
predefined character classes
6.7.3 Quantifiers
* Match 0 or more times, same as {0,}
+ Match 1 or more times, same as {1,}
? Match 1 or 0 times, same as {0,1}
{n} Match exactly n times
{n,} Match at least n times
{n,m} Match at least n but not more than m times
quantifiers
45
if - conditional branching
SYNTAX
if (EXPR) {BLOCK}
if (EXPR) {BLOCK} else {BLOCK}
if (EXPR) {BLOCK} elsif (EXPR) {BLOCK} ... else {BLOCK}
for - C-style looping structure
SYNTAX
for (INITIALIZE; TEST; INCREMENT) {BLOCK}
foreach - iterates over a list
SYNTAX
foreach VAR (LIST) {BLOCK}
while - loop structure
SYNTAX
while (EXPR) {BLOCK}
do {BLOCK} while (EXPR)
until - loop structure
SYNTAX
until (EXPR) {BLOCK}
do {BLOCK} until (EXPR)
46
7.0 CONCLUSION
We have only touched the tip of the iceberg here. Beyond just pure Perl projects, we could also manage C & Perl joint projects under this infrastructure. The infrastructure is built in Perl, which means that it is extremely portable, running on platforms ranging from Linux to Windows to S/390. Once we can get used to this infrastructure, we will find it totally invaluable for all the projects you work on. We will never have to write an install script again, and through the use of well formed test cases, you can have a far higher level of confidence that our program is performing the way it was intended.
Perl scripts which build dynamic data for a web site, and are already coded to return HTML data, can benefit from offering PDF output options to users. Relying on the external program HTMLDOC, which already does all the hard work of transforming HTML into PDF.
We're the first to admit that calling HTMLDOC externally is not the most elegant solution in the world -- sometimes, though, sheer functionality and the smile on your little user's faces is worth more than any elegance!
47