beginning perl for bioinformatics-rvs

PRACTICAL EXTRACTION & REPORT LANGUAGE

By Raghvendra Sachan

Raghvendra Sachan

CONTENTS

SL.NO. TOPIC PAGE NO.

1.0 INTRODUCTION TO PERL 2

1.1 PERL FACT! 2

1.2 WHY PERL? 3

2.0 HISTORY OF PERL 3

3.0 BIOINFORMATICS (GENERAL VIEW) 5

4.0 BIOINFORMATICS USING PERL 5

4.1 PROGRAMMING CONCEPTS 5

4.2 VARIABLE 7

4.3 STRING OPERATION 7

5.0 PERL PROGRAMS 8

5.1 TO FIND OUT THE FIRST ORF IN THE GIVEN AMINO ACID SEQUENCE 8

5.2 TO FIND OUT 6 ORF’s IN THE GIVEN DNA SEQUENCE 11

5.3 TO DETERMINE THE INFORMATION ABOUT 20 AMINO ACIDS 14

5.4 TO DETERMINE THE INFORMATION ABOUT NUCLEOTIDES. 22

5.5 TO DETERMINE THE MOLECULAR WEIGHT OF THE AMINO ACIDS SEQUENCE 25

5.6 TO DETERMINE MOLECULAR FORMULA OF THE AMINO ACIDS SEQUENCE. 28

5.7 TO FIND THE REVERSE, COMPLIMENTARY, SEQUENCE. 31

5.8 TO IDENTIFY THE NUMBER OF NUCLEOTIDES IN THE SEQUENCE. 32

5.9 TO IDENTIFY THE NUMBER OF NUCLEOTIDES AND LENGTH IN THE SEQUENCE 33

5.10 TO DETERMINE MOL. WT. OF THE DNA SEQ. USING FIL EHANDLING 34

6.0 APPENDIX 40

6.1 WHAT IS PERL? 40

6.2 VARIABLE & DATA TYPES 40

6.3 QUOTES AND STRINGS 41

6.4 OPERATORS 41

6.5 TESTING 42

6.6 BOOLEAN EXPRESSIONS 43

6.7 INPUT PERL FUNCTIONS 44

7.0 CONCLUSION 48

1.0 Introduction to Perl

1

Perl is a interpreted language optimized for scanning arbitrary text files, extracting

information from those text files, and printing reports based on that information. It's

also a good language for many system management tasks. The language is intended to

be practical (easy to use, efficient, complete) rather than beautiful (tiny, elegant,

minimal). It combines (in the author's opinion, anyway)some of the best features of

C, sed, awk, and sh, so people familiar with those languages should have little difficulty

with it. (Language historians will also note some vestiges of csh, Pascal, and even

BASIC|PLUS.) Expression syntax corresponds quite closely to C expression syntax.

http://www.activestate.com/Products/ActivePerl/

This is the officially blessed version of Perl for Windows. It is released by Active State.

Active Perl can be downloaded for free, or we can order the ActiveCD from them. It

comes with a wealth of widely used third-party libraries such as Tk, LWP, and the XML

bundle.

Whatever operating system we are on, this is a valid choice. Especially if it happen to be

on a UNIX-based operating system such as Linux, FreeBSD, Windows or Mac OS X.

The official documentation system for Perl is POD, or "Plain Old Documentation". It is

powerful and widely used.

1.1 Perl Facts

Perl is a stable, cross platform programming language.

It is used for mission critical projects in the public and private sectors.

Perl is Open Source software, licensed under its Artistic License, or the GNU

General Public License (GPL).

Perl was created by Larry Wall.

Perl 1.0 was released to usenet's alt.comp.sources in 1987

PC Magazine named Perl a finalist for its 1998 Technical Excellence Award in the

Development Tool category.

1.2 Why Perl?

Perl takes the best features from other languages, such as C, awk, sed, sh, and

BASIC, among others.

2

http://dev.perl.org/licenses/gpl1.html

http://dev.perl.org/licenses/gpl1.html

http://dev.perl.org/licenses/artistic.html

http://dev.perl.org/licenses/

http://www.opensource.org/

http://www.activestate.com/

http://www.activestate.com/Products/ActivePerl/

Perl database integration interface (DBI) supports third-party databases including

Oracle, Sybase, Postgres, MySQL and others.

Perl works with HTML, XML, and other mark-up languages.

Perl supports Unicode.

Perl is Y2K compliant.

Perl supports both procedural and object-oriented programming.

Perl interfaces with external C/C++ libraries through XS or SWIG.

Perl is extensible. There are over 500 third party modules available from the

Comprehensive Perl Archive Network (CPAN).

The Perl interpreter can be embedded into other systems.

2.0 HISTORY OF PERL

-- Larry Wall when asked if he learned Perl from the perl source

PERL 1.000

Perl 1.000 is unleashed upon the world. Some People take Perls' Birthday seriously.

Behold as Randal sings Happy Birthday to Larrys' answering machine. The description

from the original man page sums up this new language well. (18 December)

PERL 2.000

Perl 2.000 released. (5 June) Some of the enhancements from Perl1 included:

New regexp routines derived from Henry Spencer's.

Support for /(foo|bar)/.

Support for /(foo)*/ and /(foo)+/.

\s for whitespace, \S for non-, \d for digit, \D nondigit

PERL 3.000

Perl 3.000 is released and is distributed by Larry for the first time under the terms of the

GNU Public License. A few of the new features: (18 Oct)

Perl can now handle binary data correctly and has functions to pack and unpack

binary structures into arrays or lists. You can now do arbitrary ioctl functions.

You can now pass things to subroutines by reference.

Debugger enhancements.

PERL 4.000

Perl 4.000 is released and includes an artistic license as well as the GPL. (21 March)

3

http://history.perl.org/src/perl-4.0.00.tar.gz

http://www.gnu.org/copyleft/gpl.html

http://history.perl.org/src/perl-3.01.tar.gz


http://www.brian-d-foy.com/happy_bday_perl.html

http://sources.isc.org/devel/lang/perl.txt


http://cpan.perl.org/

http://www.swig.org/

http://www.perl.org/about/y2k.html

http://www.unicode.org/

http://www.mysql.com/

http://www.postgresql.org/

http://dbi.perl.org/

Linus Torvalds releases the first version of Linux. Linus had wanted to name it Freax

(free + freak + unix) but the site administrator liked Linux better. It was distributed under

the GNU Public License. (July).

PERL 5.000

The much anticipated Perl 5.000 is unveiled. It was a complete rewrite of Perl.

A few of the features and pitfalls are: (18 October)

Objects.

The documentation is much more extensive and perldoc along with pod is

introduced.

Lexical scoping available via my. eval can see the current lexical variables.

The preferred package delimiter is now :: rather than '.

New functions include: abs(), chr(), uc(), ucfirst(), lc(), lcfirst(),

chomp(), glob()

There is now an English module that provides human readable translations for

cryptic variable names.

Several previously added features have been subsumed under the new keywords use

and no.

Pattern matches may now be followed by an m or s modifier to explicitly request

multiline or singleline semantics. An s modifier makes . match newline.

@ now always interpolates an array in double-quotish strings. Some programs may

now need to use backslash to protect any @ that shouldn't interpolate.

It is no longer syntactically legal to use whitespace as the name of a variable, or as a

delimiter for any kind of quote construct.

The -w switch is much more informative.

is now a synonym for comma. This is useful as documentation for arguments that

come in pairs, such as initializers for associative arrays, or named arguments to a

subroutine.

Perl 5.001 is released. (13 March)

Perl 5.002 announced which introduced, among other things, subroutine prototypes and

sysopen(). (29 February)

4

http://www.cpan.org/clpa/1996-03/19960317-005024

http://www.wired.com/wired/5.08/linux_pr.html

3.0 Bioinformatics Definition -General view

Bioinformatics derives knowledge from computer analysis of biological data. These can

consist of the information stored in the genetic code, but also experimental results from

various sources, patient statistics, and scientific literature. Research in bioinformatics

includes method development for storage, retrieval, and analysis of the data.

Bioinformatics is a rapidly developing branch of biology and is highly interdisciplinary,

using techniques and concepts from informatics, statistics, mathematics, chemistry,

biochemistry, physics, and linguistics. It has many practical applications in different

areas of biology and medicine.

4.0 Bioinformatics using Perl

Bioinformatics, the use of computers in biology research, has been increasing in

importance during the past decade as the Human Genome Project went from its

beginning to the announcement last year of a "draft" of the complete sequence of human

DNA.

The importance of programming in biology stretches back before the previous decade.

And it certainly has a significant future now that it is a recognized part of research into

many areas of medicine and basic biological research. This may not be news to

biologists. But Perl programmers may be surprised to find that their handsome language

has become one of the most - if not the most popular - of computer languages used in

bioinformatics.

4.1 Programming Concepts

Program = a text file that contains instructions for the computer to follow

Programming Language = a set of commands that the computer understands

(via a “command interpreter”)

Input = data that is given to the program

Output = something that is produced by the program

Programming

Write the program (with a text editor)

Run the program

Look at the output

Correct the errors (debugging)

5

Repeat

(computers are VERY dumb -they do exactly what you tell them to do, so be

careful what you ask for…)

String

Text is handled in Perl as a string

This basically means that you have to put quotes around any piece of text that is

not an actual Perl instruction.

Perl has two kinds of quotes - single ‘ ‘

and double “ “

(they are different- more about this later)

Print

Perl uses the term “print” to create output

Without a print statement, you won’t know what your program has done

You need to tell Perl to put a carriage return at the end of a printed line

o Use the “\n” (newline) command

o Include the quotes

o The “\” character is called an escape - Perl uses it a lot

Numbers and Functions

Perl handles numbers in most common formats:

456

5.6743

6.3E-26

Mathematical functions work pretty much as you would expect:

4+7,6*4 ,43-27, 256/12,2/(3-5)

4.2 Variable

To be useful at all, a program needs to be able to store information from one

line to the next

Perl stores information in variables

A variable name starts with the “$” symbol, and it can store strings or

numbers

6

o Variables are case sensitive

o Give them sensible names

Use the “=”sign to assign values to variables

$a = 100

$s = “ttattagcc”

4.3 String operation

Strings (text) in variables can be used for some math-like operations

Concatenate (join) use the dot . operator

$seq1= “ACTG”;

$seq2= “GGCTA”;

$seq3= $seq1 . $seq2;

print $seq3

ACTGGGCTA

String comparison (are they the same, > or <)

eq (equal )

ne (not equal )

ge (greater or equal )

gt (greater than )

lt (less than )

le (less or equal )

5.0 PERL PROGRAMS

5.1 PROGRAM NO.1

print "ENTER THE m-RNA SEQUENCE\n";

$a=<stdin>;

chomp($a);

$len=length($a);

print" THE LENGTH OF DNA SEQUENCE IS $len\n";

7

$c=0;

$g='';

while ($c<$len){

$b=substr($a,$c,3);

if ($b=~ /AUG/)

{

$g=$g.'M';

}

if ($b=~ /(UUA)|(UUG)|(CUU)|(CUC)|(CUA)|(CUG)/)

{

$g=$g.'L';

}

if ($b=~ /(UCU)|(UCC)|(UCA)|(UCG)|(AGU)|(ACG)/)

{

$g=$g.'S';

}

if ($b=~ /(AUU)|(AUC)|(AUA)/)

{

$g=$g.'I';

}

if ($b=~ /(UUU)|(UUC)/)

{

$g=$g.'F';

}

if ($b=~ /(GUU)|(GUC)|(GUA)|(GUG)/)

{

$g=$g.'V';

}

if ($b=~ /(CCU)|(CCC)|(CCA)|(CCG)/)

{

$g=$g.'P';

8

}

if ($b=~ /(ACU)|(ACC)|(ACA)|(ACG)/)

{

$g=$g.'T';

}

if ($b=~ /(GCU)|(GCC)|(GCA)|(GCG)/)

{

$g=$g.'A';

}

if ($b=~ /(UAU)|(UAC)/)

{

$g=$g.'Y';

}

if ($b=~ /(UGU)|(UGC)/)

{

$g=$g.'C';

}

if ($b=~ /UGG/)

{

$g=$g.'W';

}

if ($b=~ /(CAU)|(CAC)/)

{

$g=$g.'H';

}

if ($b=~ /(CAA)|(CAG)/)

{

$g=$g.'Q';

}

if ($b=~ /(CGU)|(CGC)|(CGA)|(AGG)|(AGA)|(AGG)/)

{

9

$g=$g.'R';

}

if ($b=~ /(AAU)|(AAC)/)

{

$g=$g.'N';

}

if ($b=~ /(AAA)|(AAG)/)

{

$g=$g.'K';

}

if ($b=~ /(CGU)|(GGC)|(GGA)|(GGG)/)

{

$g=$g.'G';

}

if ($b=~ /(GAA)|(GAG)/)

{

$g=$g.'E';

}

if ($b=~ /(GAU)|(GAC)/)

{

$g=$g.'D';

}

if ($b=~ /(UAA)|(UAG)(UGA)/)

{

$g=$g.'#';

}

$c=$c+3;

}

print"THE AMINO ACID IN THE SEQUENCE IN 1ST ORF IS $g";

10

RESULT

ENTER THE m-RNA SEQUENCE

AUCGAUCGAUGC

THE LENGTH OF DNA SEQUENCE IS 12

THE AMINO ACID IN THE SEQUENCE IN THE 1ST ORF IS IDRC

COMMENT

AIM: TO FIND OUT THE FIRST ORF IN THE GIVEN AMINO ACID SEQUENCE.

5.2 PROGRAM NO.2.

print "ENTER THE DNA SEQUENCE\n";

$dna=<stdin>;

chomp($dna);

$dna1=$dna;

$len=length($dna);

$dna=~tr/ATGC/UACG/;

print"\nmRNA: $dna\n";

print "\nLENGTH: $len\n";

sub dna

{

$i=0;

$b=3;

$p='';

while($i<$len)

{

$seq=substr($dna,$i,$b);

if ($seq=~/GC./i) {$p.='A';}

if ($seq=~/UG[UC]/i) {$p.='C';}

if ($seq=~/GA[UC]/i) {$p.='D';}

11

if ($seq=~/GA[AG]/i) {$p.='E';}

if ($seq=~/UU[UC]/i) {$p.='F';}

if ($seq=~/GG./i) {$p.='G';}

if ($seq=~/CA[UC]/i) {$p.='H';}

if ($seq=~/AU[UCA]/i) {$p.='I';}

if ($seq=~/AA[AG]/i) {$p.='K';}

if ($seq=~/UU[AG]/i) {$p.='L';}

if ($seq=~/AUG/i) {$p.='M';}

if ($seq=~/AA[UC]/i) {$p.='N';}

if ($seq=~/CC./i) {$p.='P';}

if ($seq=~/CA[AG]/i) {$p.='Q';}

if ($seq=~/CG.|AG[AG]/i){$p.='R';}

if ($seq=~/UC.|AG[UC]/i){$p.='S';}

if ($seq=~/AC./i) {$p.='T';}

if ($seq=~/GU./i) {$p.='V';}

if ($seq=~/UGG/i) {$p.='W';}

if ($seq=~/UA[UC]/i) {$p.='Y';}

if ($seq=~/CU./i) {$p.='L';}

if ($seq=~/UA[AG]|UGA/i){$p.='*';}

$i=$i+3;

}

return $p;

}

print"\nFIRST READING FRAME ";

$q=dna();

print": $q\n";

print"\nSECOND READING FRAME ";

$dna=substr($dna,1,$len);

$p=dna();

print": $p\n";

print"\nTHIRD READING FRAME ";

12


$x=dna();

print": $x\n";

$rev=reverse($dna1);

$rev=~ tr/ACTG/UGAC/;

print "\nREVERSE mRNA : $rev\n ";

print"\nFOURTH READING FRAME ";

$q1=dna();

print": $q1\n";

print"\nFIFTH READING FRAME ";


$p1=dna();

print": $p1\n";

print"\nSIXTH READING FRAME ";


$x1=dna();

print": $x1\n";

RESULT

ENTER THE DNA SEQUENCE

ATGCGTGACATG

mRNA : UACGCACUGUAC

LENGTH 12

FIRST READING FRAME : YALY

SECOND READING FRAME : THC

THIRD READING FRAME : RTV

REVERSE mRNA : CAUGUCACGCAU

FOURTH READING FRAME : RTV

FIFTH READING FRAME : ALY

SIXTH READING FRAME : HC

COMMMENT

AIM: TO FIND OUT 6 ORF’s IN THE GIVEN DNA SEQUENCE.

13

5.3 PROGRAM NO.3

do{

print"*" x 50;

print "\nEnter E for ESSENTIAL AMINO ACIDS\n";

print "Enter N for NONESSENTIALS\n";

print"*" x 50;

$a=<stdin>;

chomp($a);

if ($a eq 'E')

{

print "Isoleucine(I)\n

Leucine(L)\n

Lysine(K)\n

Methionine(M)\n

Phenylalanine(F)\n

Threonine(T)\n

Tryptophan(W)\n

Valine(V)\n

Arginine(R)\n

Histidine(H)\n";

}

if($a eq 'N')

{

print "Alanine(A)\n

Asparagine(N)\n

Aspartate(D)\n

Cysteine(C)\n

Glutamate(E)\n

Glutamine(Q)\n

Glycine(G)\n

14

Proline(P)\n

Serine(S)\n

Tyrosine(Y)\n";

}

$b= <stdin>;

chomp($b);

if ($b eq 'I')

{

print "Isoleucine\n

Chemical formula: C6H13NO2\n

Molecular mass: 131.18 [1] g•mol-1\n

Systematic name:\n

(2S,3S)-2-amino-3-methylpentanoic acid\n

Abbreviations: I, Ile\n

Synonyms:\n

{2/α}-amino-{3/β}-methylvaleric acid\n

3-methyl-{/erythro-}norvaline\n

Amino-sec-butyl-acetic acid\n

Amino(1-methylpropyl)-acetic acid\n";

}

if ($b eq 'L')

{

print"Leucine\n

Chemical formula: C6H13NO2\n

Molecular mass: 131.18 g•mol-1\n

Systematic name:\n

(S)-2-amino-4-methyl-pentanoic acid\n

Abbreviations: L, Leu\n

Synonyms:\n

{(S)-/L-}2-amino-4-methylvaleric acid\n

4-methyl-norvaline\n

15

α-aminoisocaproic acid\n";

}

if ($b eq 'K')

{

print"Lysine\n

Systematic name (S)-2,6-Diaminohexanoic acid\n

Abbreviations Lys,k\n

Chemical formula C6H14N2O2\n

Molecular mass 146.19 g/mol\n

PubChem 876\n

Melting point 224 °C\n";

}

if ($b eq 'M')

{

print"Methionine\n

Systematic name (S)-2-amino-4-(methylsulfanyl)-\n

butanoic acid\n

Abbreviations Met,m\n

Chemical formula C5H11NO2S\n

Molecular mass 149.21 g mol-1\n


}

if ($b eq 'F')

{

print "Phenylalanine\n

Systematic name 2-Amino-3-phenyl-propanoic acid\n

Abbreviations Phe,F\n

Chemical formula C9H11NO2\n



}

16

if ($b eq 'T')

{

print" Threonine\n

Systematic name (2S,3R)-2-Amino-3-hydroxybutanoic acid\n

Abbreviations Thr,T\n

Chemical formula C4H9NO\n



}

if ($b eq 'W')

{

print" Tryptophan\n

Systematic name (S)-2-Amino-3-(1H-indol-3-yl)-propionic acid\n

Abbreviations Trp,W\n


Molecular mass 204.23 g mol−1\n

Melting point 289 °C";\n

}

if ($b eq 'W')

{

print" Valine\n

Systematic name (S)-2-amino-3-methyl-butanoic acid\n

Abbreviations Val,V\n




}

if ($b eq 'R')

{

print"Arginine\n

17

Systematic (IUPAC) name

2-amino-5-(diaminomethylidene

amino)pentanoic acid\n

Chemical data\n

Formula C6H14N4O2\n

Mol. weight 174.2\n";

}

if ($b eq 'H')

{

print" Histidine\n

Systematic (IUPAC) name\n

2-amino-3-(3H-imidazol-4-yl)propanoic acid\n

Chemical data\n

Formula C6H9N3O2\n


}

if ($b eq 'A')

{

print" Alanine\n


(S)-2-aminopropanoic acid\n

Chemical data\n

Formula C3H7NO\n


}

if ($b eq 'N')

{

print"Asparagine\n


(2S)-2-amino-3-carbamoyl-propanoic acid\n

Chemical data\n

18

Formula C4H8N2O3\n


}

if ($b eq 'C')

{

print "Cysteine\n


(2R)-2-amino-3-sulfanyl-propanoic acid\n

Chemical dat\n

Formula C3H7NO2S\n


}

if ($b eq 'A')

{

print"Aspartic acid\n


(2S)-2-aminobutanedioic acid\n

Chemical data\n

Formula C4H7NO4\n


}

if ($b eq 'E')

{

print"Glutamic acid\n


(2S)-2-aminopentanedioic acid\n

Chemical data\n

Formula C5H9NO4\n


}

if ($b eq 'Q')

19

{

print" Glutamine\n


(2S)-2-amino-4-carbamoyl-butanoic acid\n

Chemical data\n

Formula C5H10N2O3\n


}

if ($b eq 'G')

{

print" Glycine\n


aminoethanoic acid\n

Chemical data\n

Formula C2H5NO2\n


}

if ($b eq 'P')

{

print" Proline\n

Systematic name (S)-Pyrrolidine-2-carboxylic acid\n

Abbreviations Pro,P\n




}

if ($b eq 'S')

{

print" Serine\n

Systematic name (S)-2-amino-3-hydroxypropanoic acid\n

20

Abbreviations Ser,S\n



Melting point 228 °C \n";

}

if ($b eq 'Y')

{

print"Tyrosine\n

Systematic name (S)-2-Amino-3-(4-hydroxy-phenyl)-propanoic acid\n

Abbreviations Tyr,Y\n




}

print "\nEnter Again press Y";

$y=<stdin>;

chomp($y);

)

while($y eq 'Y')

RESULT

ENTER E FOR ESSENTIAL AMINO ACIDS

ENTER N FOR NON ESSENTIALS

E

LIST OF ESSENTIAL AMINO ACIDS

I

Isoleucine

Chemical formula: C6H13NO2

Molecular mass: 131.18 [1] g·mol-1

Systematic name:

21

(2S,3S)-2-amino-3-methylpentanoic acid

Abbreviations: I, Ile

Synonyms:

{2/α}-amino-{3/β}-methylvaleric acid

3-methyl-{/erythro-}norvaline

Amino-sec-butyl-acetic acid

Amino(1-methylpropyl)-acetic acid

To Start Again Press Y

COMMENT

AIM:TO DETERMINE THE INFORMATION ABOUT 20 AMINO ACIDS.

5.4 PROGRAM NO.4

print"*" x 30;

print "\nEnter 1 for ADENINE\n";

print "Enter 2 for GUANINE\n";

print "Enter 3 for THYMINE\n";

print "Enter 4 for CYTOSINE\n";

print "ENTER 5 for URACIL\n";

print "ENTER YOUR CHOICE\n";

$a =<stdin>;

if($a==1)

{

print "ADENINE\n

Systematic (IUPAC) name 7H-purin-6-amine\n

Synonyms 6-aminopurine\n

Identifiers CAS number 73-24-5 PubChem 190\n

Chemical data\n

Formula C5H5N5\n

Mol. weight 135.127\n

SMILES NC1=NC=NC2=C1N=CN2\n

22

Physical data\n

Melt. point\n

360 - 365 °C (-265 °F)\n";

}

if ($a==2)

{

print "GUANINE\n

Systematic name 2-amino-1H-purin-6(9H)-one\n

Other names 2-amino-6-oxo-purine,2-aminohypoxanthine\n

Molecular formula C5H5N5O\n

SMILES NC(NC1=O)=NC2=C1N=CN2\n

Molar mass 151.1261 g/mol\n

Appearance White amorphous solid\n

CAS number [73-40-5]\n

Melting point 360°C (633.15 K) deco.\n

Boiling point Sublimes\n";

}

if ($a==3)

{

print "THYMINE\n

Chemical name 5-Methylpyrimidine-2,4(1H,3H)-dione\n



Melting point 316 - 317 °C\n

CAS number 65-71-4\n

SMILES CC1=CNC(NC1=O)=O\n";

}

if ($a==4)

{

print "CYTOSINE\n

Chemical name 4-Aminopyrimidin-2(1H)-one\n

23

Chemical formula C4H5N3O\n


Melting point 320 - 325°C (decomp)\n

CAS number 71-30-7\n

SMILES NC1=NC(NC=C1)=O\n";

}

if ($a==5)

{

print "URACIL\n

Systematic name Pyrimidine-2,4(1H,3H)-dione\n

Other names Uracil, 2-oxy-4-oxy pyrimidine\n

Molecular formula C4H4N2O2\n

Molar mass 112.08676 g/mol\n

Appearance Solid\n

CAS number [66-22-8]\n

Melting point 335 °C (608 K)\n

Boiling point N/A\n

Acidity (pKa) basic pKa = -3.4\n

acidic pKa = 9.389\n";

}

print "\nTo Start Again press Y";

$y=<stdin>;

chomp($y);

}

while($y eq 'Y')

RESULT

Enter 1 for ADENINE

Enter 2 for GUANINE

Enter 3 for THYMINE

Enter 4 for CYTOSINE

ENTER 5 for URACIL

24

ENTER YOUR CHOICE

1

ADENINE

Systematic (IUPAC) name 7H-purin-6-amine

Synonyms 6-aminopurine

Identifiers CAS number 73-24-5 PubChem 190

Chemical data

Formula C5H5N5

Mol. weight 135.127

SMILES NC1=NC=NC2=C1N=CN2

Physical data

Melt. point

360 - 365 °C (-265 °F)


COMMENT

AIM: TO DETERMINE THE INFORMATION ABOUT NUCLEOTIDES.

5.5 PROGRAM NO.5

print "ENTER THE AMINO ACID SEQUENCE\n";

$a=<stdin>;

chomp($a);

$x=length($a);

print "LENGTH:$x ";

@a=split('',$a);

$b= 0;

foreach $i(@a){

if($i eq 'G'){

$b = $b+75.07;

}

if($i eq 'A'){

$b = $b+89.09;

25

}

if($i eq 'V'){

$b = $b+117.15;

}

if($i eq 'L'){

$b = $b+131.18;

}

if($i eq 'I'){

$b = $b+131.18;

}

if($i eq 'S'){

$b = $b+105.09;

}

if($i eq 'T'){

$b = $b+119.12;

}

if($i eq 'C'){

$b = $b+121.15;

}

if($i eq 'M'){

$b = $b+149.21;

}

if($i eq 'F'){

$b = $b+165.19;

}

if($i eq 'Y'){

$b = $b+181.19;

}

if($i eq 'W'){

$b = $b+204.23;

}

26

if($i eq 'P'){

$b = $b+115.13;

}

if($i eq 'N'){

$b = $b+132.12;

}

if($i eq 'Q'){

$b = $b+146.15;

}

if($i eq 'D'){

$b = $b+133.10;

}

if($i eq 'E'){

$b = $b+147.13;

}

if($i eq 'K'){

$b = $b+146.19;

}

if($i eq 'H'){

$b = $b+155.16;

}

if($i eq 'R'){

$b = $b+174.20;

}

}

$c=$b-(18*($x-1));

print "The MOLECULAR WEIGHT of the sequence is $c";

RESULT

ENTER THE AMINO ACID SEQUENCE

AVLIST

LENGTH:4

27

THE MOLECULAR WEIGHT OF THE SEQUENCE IS 414.16

COMMENT

AIM:TO DETERMINE THE MOLECULAR WEIGHT OF THE AMINO ACIDS

SEQUENCE.

5.6 PROGRAM NO.6

$b= <stdin>;

chomp($b);

if ($b eq 'G')

{

print " GLYCINE=C2H5NO2";

}

if ($b eq 'A')

{

print " ALANINE=C3H7NO2";

}

if ($b eq 'V')

{

print " VALINE=C5H11NO2";

}

if ($b eq 'L')

{

print " LEUCINE = C6H13NO2";

}

if ($b eq 'I')

{

print " ISOLEUCINE=C6H13NO2";

}

if ($b eq 'S')

{

28

print " SERINRE = C3H7NO3";

}

if ($b eq 'T')

{

print " THREONINE = C4H9NO3";

}

if ($b eq 'C')

{

print " CYSTINE = C3H7NO2S";

}

if ($b eq 'M')

{

print " METHIONINE = C5H11NO2S";

}

if ($b eq 'F')

{

print " PHENYLALANINE = C9H11NO2";

}

if ($b eq 'Y')

{

print " TYROSINE = C9H11NO3";

}

if ($b eq 'W')

{

print " TRYPTOPHAN = C11H12N2O2";

}

if ($b eq 'P')

{

print " PROLINE = C5H9NO2";

}

29

if ($b eq 'N')

{

print " ASPARAGINE = C4H8N2O3";

}

if ($b eq 'Q')

{

print " GLUTAMINE = C5H10N2O3";

}

if ($b eq 'D')

{

print " ASPARTIC ACID = C4H7NO4";

}

if ($b eq 'E')

{

print " GLUTAMIC ACID = C5H9NO4";

}

if ($b eq 'K')

{

print " LYSINE = C6H14N2O2";

}

if ($b eq 'H')

{

print " HISTIDINE = C6H9N3O2";

}

if ($b eq 'R')

{

print " ARGININE = C6H14N4O2";

}

RESULT

A

ALANINE= C3H7NO2

30

COMMENT

AIM:TO DETERMINE THE MOLECULAR FORMULA OF THE AMINO ACIDS

SEQUENCE.

5.7 Program no. 7.

$a=<stdin>;

chomp($a);

print "original seq $a\n";

$a= reverse $a;

print" reverse seq $a\n";

$a=~ tr/ATGC/TACG/;

print "COMPLIMENTARY seq $a\n";

$a= reverse $a;

print "reverse complimentary $a\n";

RESULT

ATGC

ORIGINAL SEQUENCE ATGC

REVERSE SEQUENCE CGTA

COMPLIMENTARY SEQUENCE GCAT

REVERSE COMPLIMENTARY SEQUENCE TACG

COMMENT

AIM: TO FIND THE REVERSE SEQUENCE, COMPLIMENTARY SEQUENCE,

REVERSE COMPLIMENTARY SEQUENCE.

5.8 PROGRAM NO.8

$a=<stdin>;

chomp($a);

$l= length($a);

@a= split('',$a);

31

$A =0;

$T =0;

$C =0;

$G =0;

foreach $i(@a){

if ($i eq 'A')

{

$A=$A+1;

}

if ($i eq 'T')

{

$T=$T+1;

}

if ($i eq 'C')

{

$C=$C+1;

}

if ($i eq 'G')

{

$G=$G+1;

}

}

print "Adenine = $A";

print "Cytosine= $C";

print "Guanine = $G";

print "Thymine = $T";

print "length= $l";T";

print "length= $l";

RESULT

ATCG

Adenine = 1 Cytosine= 1Guanine = 1Thymine1

32

COMMENT

AIM: TO IDENTIFY THE NUMBER OF NUCLEOTIDES IN THE SEQUENCE.

5.9 PROGRAM NO 9

$a=<stdin>;

chomp($a);

$l= length($a);

@a= split('',$a);

$A =0;

$T =0;

$C =0;

$G =0;

foreach $i(@a){

if ($i eq 'A')

{

$A=$A+1;

}

if ($i eq 'T')

{

$T=$T+1;

}

if ($i eq 'C')

{

$C=$C+1;

}

if ($i eq 'G')

{

$G=$G+1;

}

}

print "Adenine = $A\n";

33

print "Cytosine= $C\n";

print "Guanine = $G\n";

print "Thymine = $T\n";

print "length= $l\n";

RESULT

ATCG

Adenine = 1 Cytosine= 1Guanine = 1Thymine 1

LENGTH 4

COMMENT

AIM: TO IDENTIFY THE NUMBER OF NUCLEOTIDES AND THE LENGTH IN

THE SEQUENCE.

5.10 PROGRAM NO.10

$dna="D:/orf/as.txt";;

open(DNA,$dna);

@dna=<DNA>;

$dna=join('',@dna);

$dna=~s/\s//g;

print "$dna";

do{

sub A

{

$l= length($dna);

print"$l\n";

}

sub B

34

{

$dna= reverse ($dna);

print" reverse seq $dna\n";

}

sub C

{

$dna=~ tr/ATGC/TACG/;

print"COMPLIMENTARY seq $dna\n";

}

sub D

{

$dna=~ tr/ATGC/TACG/;

$dn= reverse ($dna);

print "reverse complimentary $dn\n";

}

sub E

{

@a= split('',$dna);

$A =0;

$T =0;

$C =0;

$G =0;

foreach $i(@a){

35

if ($i eq 'A')

{

$A=$A+1;

}

if ($i eq 'T')

{

$T=$T+1;

}

if ($i eq 'C')

{

$C=$C+1;

}

if ($i eq 'G')

{

$G=$G+1;

}

}

print "Adenine = $A\n";

print "Cytosine= $C\n";

print "Guanine = $G\n";

print "Thymine = $T\n";

$e= ($A* 313.21) + ($T* 288.20) + ($G * 329.21) + ($C* 289.19) - (18.02);

print "$e";

36

}

print"\nenter 1 for length";

print"\nenter 2 for reverse\n";

print"enter 3 for complimentary\n";

print"enter 4 for reverse complimentary\n";

print"enter 5 for molecular weight of the sequence\n";

$a =<stdin>;

if($a==1)

{

A;

}

if($a==2)

{

B;

}

if($a==3)

{

C;

}

if($a==4)

{

D;

}

37

if($a==5)

{

E;

}

print "\nTo Start Again Press Y";

$y=<stdin>;

chomp($y);

}

while($y eq 'Y')

RESULT

enter 1 for length

enter 2 for reverse

enter 3 for complimentary

enter 4 for reverse complimentary

enter 5 for molecular weight of the sequence

5

ADENINE=2

THYMINE=2

GUANINE=2

CYTOSINE=2

MOL. WT.= 2421.6


COMMENT

38

AIM: TO DETERMINE THE LENGTH,REVERSE, COMPLIMENTARY,REVERSE COMPLIMENTARYAND MOLECULAR WEIGHT OF THE GIVEN DNA SEQUENCE USING FILE HANDLING.

6.0 APPENDIX

6.1 What is Perl?

Perl is a high-level programming language with an eclectic heritage written by Larry

Wall and a cast of thousands. It derives from the ubiquitous C programming language

and to a lesser extent from sed, awk, the Unix shell, and at least a dozen other tools and

39

languages. Perl's process, file, and text manipulation facilities make it particularly well-

suited for tasks involving quick prototyping, system utilities, software tools, system

management tasks, database access, graphical programming, networking, and world wide

web programming. These strengths make it especially popular with system administrators

and CGI script authors, but mathematicians, geneticists, journalists, and even managers

also use Perl.

6.2 Variables & Data Types

a variable is a named location in memory that is used to hold data that may be modified

by the program. Perl has three scema for keeping data during program execution: scalars,

arrays of scalars (also known as lists), and hashes. Arrays are grouped scalars indexed by

number, while hashes are indexed by strings.

Scalars

The most basic kind of data structure in Perl is the scalar variable. Scalar variables can

hold both strings and numbers.

$bodytemp = 98.6;

sets the scalar variable $BodyTemp to 98.6, but you can also assign a

string to exactly the same variable:

$bodytemp = 'normal';

Perl will also accept numbers as strings,

$bodytemp = '098.6';

and still performs arithmetic and other operations on them.

Arrays

An array variable is a list of scalars, hence in perl they are often refered to as lists. They

have the same format as scalar variables except that they are prefixed by an @ symbol.

The following statements:

@valine = ("gtg", "gtt", "gta", "gtc");

@hydrophobics = ("valine", "leucine","isoleucine");

@weights(117.15, 131.17, 131.17);

assign a four element list to the array variable @valine and a three element list to the

array variables @hydrophobics and @weights.

Hashes

40

Basically hashes are arrays which are accessed by a string. They are also refered to as

associative arrays.To define a hash we can use the usual parenthesis notation, but the

array itself is prefixed by a % sign. Suppose we want to store all the hydrophobic amino

acids with their molecular weights in a single data structure. It would look like this:

%molyweights = ("valine", 117.15,

"leucine", 131.17,

"isoleucine", 131.17);

@data is a list array that has an element for every string and scalar in the hash

%molyweights.

6.3 Quotes & Strings

\t tab

\n newline

\b backspace

\a alarm (bell)

\$ literal $

\@ literal @

\\ literal

(special characters)

6.4 Operators

Precidence

Use parentheses when in doubt.

Arithmetic Operators

Math in Perl

x**y exponentiation

-x negation

x/y division

x*y multiplication

41

x+y addition

x-y subtraction

Auto-increment and Auto-decrement ++ and -- work as increment and decrement

Assignment Operators is the ordinary assignment operator

String Operators . Concatenates two strings. For example,

$a = 'winter'.'green'; # $a is wintergreen

6.5 Testing

Primarily for Numeric Comparison

== TRUE if the left argument is numerically equal to the

right argument; otherwise FALSE.

!= TRUE if the left argument is numerically not equal to

the right argument; otherwise FALSE

< TRUE if the left argument is numerically less than the

right argument; otherwise FALSE

> TRUE if the left argument is numerically greater than


<= TRUE if the left argument is numerically less than or

equal to the right argument; otherwise FALSE

>= TRUE if the left argument is numerically greater than

or equal to the right argument; otherwise FALSE

<=> returns -1, 0, or 1 depending on whether the left

argument is numerically less than, equal to, or greater

than the right argument

Primarily for String and Character Comparison

eq returns TRUE if the left argument is stringwise equal to


ne returns TRUE if the left argument is stringwise not equal

42

to the right argument; otherwise FALSE

lt returns TRUE if the left argument is stringwise less than


gt returns TRUE if the left argument is stringwise greater

than the right argument; otherwise FALSE

le returns TRUE if the left argument is stringwise less than

or equal to the right argument; otherwise FALSE

ge returns TRUE if the left argument is stringwise greater

than or equal to the right argument; otherwise FALSE

cmp returns -1, 0, or 1 depending on whether the left

argument is stringwise less than, equal to, or greater than

the right argument

6.6 Boolean Expressions

You can also use logical AND, OR and NOT to create more complex expressions:

($a && $b) Are $a AND $b TRUE ?

($a || $b) Is either $a OR $b TRUE ?

!($a) Is $a FALSE ?

6.7 Important Perl Functions

Any function in the list below may be used either with or without parentheses around its

arguments.

Input and Output

print - output a list to the screen or a file

SYNOPSIS

print FILEHANDLE LIST

print LIST

open - open a file

SYNOPSIS

43

open FILEHANDLE, FILENAME

close - close a file

SYNOPSIS

close FILEHANDLE

close

String Functions

length - return the number of bytes in a string

SYNOPSIS

length EXPR

reverse - reverse a string or a list

SYNOPSIS

reverse STRING

reverse LIST

substr - get or alter a portion of a string

SYNOPSIS

substr EXPR,OFFSET,LEN,REPLACEMENT

substr EXPR,OFFSET,LEN

substr EXPR,OFFSET

index - left-to-right substring search

SYNOPSIS

index STR, SUBSTR, POSITION

index STR, SUBSTR

rindex - right-to-left substring search

SYNOPSIS

rindex STR,SUBSTR,POSITION

rindex STR,SUBSTR

Numeric Functions

abs - absolute value function

cos - cosine function

exp - raise e to a power

int - get the integer portion of a number

44

log - retrieve the natural logarithm for a number

sin - return the sin of a number

sqrt - square root function

6.7.1 Metacharacters

Metacharacters are used to broaden the capabilities of a pattern to match multiple strings

or in specific locations. The following are recognized:

. Match any character (except newline)

^ Match the beginning of the line

$ Match the end of the line (or before newline at the end)

| Alternation

( ) Grouping

[ ] Character class

metacharacters

6.7.2 Character Classes: Perl also provides some predefined character classes. The

following can be used in place of their bracketed alternatives:

\w Match a "word" character [a-zA-Z_0-9]

\W Match a non-word character [^a-zA-Z_0-9]

\s Match a whitespace character [ \t\n\r\f]

\S Match a non-whitespace character [^ \t\n\r\f]

\d Match a digit character [0-9]

\D Match a non-digit character [^0-9]

predefined character classes

6.7.3 Quantifiers

* Match 0 or more times, same as {0,}

+ Match 1 or more times, same as {1,}

? Match 1 or 0 times, same as {0,1}

{n} Match exactly n times

{n,} Match at least n times

{n,m} Match at least n but not more than m times

quantifiers

45

if - conditional branching

SYNTAX

if (EXPR) {BLOCK}

if (EXPR) {BLOCK} else {BLOCK}

if (EXPR) {BLOCK} elsif (EXPR) {BLOCK} ... else {BLOCK}

for - C-style looping structure

SYNTAX

for (INITIALIZE; TEST; INCREMENT) {BLOCK}

foreach - iterates over a list

SYNTAX

foreach VAR (LIST) {BLOCK}

while - loop structure

SYNTAX

while (EXPR) {BLOCK}

do {BLOCK} while (EXPR)

until - loop structure

SYNTAX

until (EXPR) {BLOCK}

do {BLOCK} until (EXPR)

46

7.0 CONCLUSION

We have only touched the tip of the iceberg here. Beyond just pure Perl projects, we could also manage C & Perl joint projects under this infrastructure. The infrastructure is built in Perl, which means that it is extremely portable, running on platforms ranging from Linux to Windows to S/390. Once we can get used to this infrastructure, we will find it totally invaluable for all the projects you work on. We will never have to write an install script again, and through the use of well formed test cases, you can have a far higher level of confidence that our program is performing the way it was intended.

Perl scripts which build dynamic data for a web site, and are already coded to return HTML data, can benefit from offering PDF output options to users. Relying on the external program HTMLDOC, which already does all the hard work of transforming HTML into PDF.

We're the first to admit that calling HTMLDOC externally is not the most elegant solution in the world -- sometimes, though, sheer functionality and the smile on your little user's faces is worth more than any elegance!

47

beginning perl for bioinformatics-rvs

Documents