beginning perl for bioinformatics-rvs

61
PRACTICAL EXTRACTION & REPORT LANGUAGE By Raghvendra Sachan Raghvendra Sachan

Upload: raghvendra-sachan

Post on 10-Apr-2015

687 views

Category:

Documents


2 download

DESCRIPTION

Perl is a interpreted language optimized for scanning arbitrary text files, extracting information from those text files, and printing reports based on that information. It's also a good language for many system management tasks.

TRANSCRIPT

Page 1: Beginning Perl for Bioinformatics-RVS

PRACTICAL EXTRACTION & REPORT LANGUAGE

By Raghvendra Sachan

Raghvendra Sachan

Page 2: Beginning Perl for Bioinformatics-RVS

CONTENTS

SL.NO. TOPIC PAGE NO.

1.0 INTRODUCTION TO PERL 2

1.1 PERL FACT! 2

1.2 WHY PERL? 3

2.0 HISTORY OF PERL 3

3.0 BIOINFORMATICS (GENERAL VIEW) 5

4.0 BIOINFORMATICS USING PERL 5

4.1 PROGRAMMING CONCEPTS 5

4.2 VARIABLE 7

4.3 STRING OPERATION 7

5.0 PERL PROGRAMS 8

5.1 TO FIND OUT THE FIRST ORF IN THE GIVEN AMINO ACID SEQUENCE 8

5.2 TO FIND OUT 6 ORF’s IN THE GIVEN DNA SEQUENCE 11

5.3 TO DETERMINE THE INFORMATION ABOUT 20 AMINO ACIDS 14

5.4 TO DETERMINE THE INFORMATION ABOUT NUCLEOTIDES. 22

5.5 TO DETERMINE THE MOLECULAR WEIGHT OF THE AMINO ACIDS SEQUENCE 25

5.6 TO DETERMINE MOLECULAR FORMULA OF THE AMINO ACIDS SEQUENCE. 28

5.7 TO FIND THE REVERSE, COMPLIMENTARY, SEQUENCE. 31

5.8 TO IDENTIFY THE NUMBER OF NUCLEOTIDES IN THE SEQUENCE. 32

5.9 TO IDENTIFY THE NUMBER OF NUCLEOTIDES AND LENGTH IN THE SEQUENCE 33

5.10 TO DETERMINE MOL. WT. OF THE DNA SEQ. USING FIL EHANDLING 34

6.0 APPENDIX 40

6.1 WHAT IS PERL? 40

6.2 VARIABLE & DATA TYPES 40

6.3 QUOTES AND STRINGS 41

6.4 OPERATORS 41

6.5 TESTING 42

6.6 BOOLEAN EXPRESSIONS 43

6.7 INPUT PERL FUNCTIONS 44

7.0 CONCLUSION 48

1.0 Introduction to Perl

1

Page 3: Beginning Perl for Bioinformatics-RVS

Perl is a interpreted language optimized for scanning arbitrary text files, extracting

information from those text files, and printing reports based on that information. It's

also a good language for many system management tasks. The language is intended to

be practical (easy to use, efficient, complete) rather than beautiful (tiny, elegant,

minimal). It combines (in the author's opinion, anyway)some of the best features of

C, sed, awk, and sh, so people familiar with those languages should have little difficulty

with it. (Language historians will also note some vestiges of csh, Pascal, and even

BASIC|PLUS.) Expression syntax corresponds quite closely to C expression syntax.

http://www.activestate.com/Products/ActivePerl/

This is the officially blessed version of Perl for Windows. It is released by Active State.

Active Perl can be downloaded for free, or we can order the ActiveCD from them. It

comes with a wealth of widely used third-party libraries such as Tk, LWP, and the XML

bundle.

Whatever operating system we are on, this is a valid choice. Especially if it happen to be

on a UNIX-based operating system such as Linux, FreeBSD, Windows or Mac OS X.

The official documentation system for Perl is POD, or "Plain Old Documentation". It is

powerful and widely used.

1.1 Perl Facts

Perl is a stable, cross platform programming language.

It is used for mission critical projects in the public and private sectors.

Perl is Open Source software, licensed under its Artistic License, or the GNU

General Public License (GPL).

Perl was created by Larry Wall.

Perl 1.0 was released to usenet's alt.comp.sources in 1987

PC Magazine named Perl a finalist for its 1998 Technical Excellence Award in the

Development Tool category.

1.2 Why Perl?

Perl takes the best features from other languages, such as C, awk, sed, sh, and

BASIC, among others.

2

Page 4: Beginning Perl for Bioinformatics-RVS

Perl database integration interface (DBI) supports third-party databases including

Oracle, Sybase, Postgres, MySQL and others.

Perl works with HTML, XML, and other mark-up languages.

Perl supports Unicode.

Perl is Y2K compliant.

Perl supports both procedural and object-oriented programming.

Perl interfaces with external C/C++ libraries through XS or SWIG.

Perl is extensible. There are over 500 third party modules available from the

Comprehensive Perl Archive Network (CPAN).

The Perl interpreter can be embedded into other systems.

2.0 HISTORY OF PERL

-- Larry Wall when asked if he learned Perl from the perl source

PERL 1.000

Perl 1.000 is unleashed upon the world. Some People take Perls' Birthday seriously.

Behold as Randal sings Happy Birthday to Larrys' answering machine. The description

from the original man page sums up this new language well. (18 December)

PERL 2.000

Perl 2.000 released. (5 June) Some of the enhancements from Perl1 included:

New regexp routines derived from Henry Spencer's.

Support for /(foo|bar)/.

Support for /(foo)*/ and /(foo)+/.

\s for whitespace, \S for non-, \d for digit, \D nondigit

PERL 3.000

Perl 3.000 is released and is distributed by Larry for the first time under the terms of the

GNU Public License. A few of the new features: (18 Oct)

Perl can now handle binary data correctly and has functions to pack and unpack

binary structures into arrays or lists. You can now do arbitrary ioctl functions.

You can now pass things to subroutines by reference.

Debugger enhancements.

PERL 4.000

Perl 4.000 is released and includes an artistic license as well as the GPL. (21 March)

3

Page 5: Beginning Perl for Bioinformatics-RVS

Linus Torvalds releases the first version of Linux. Linus had wanted to name it Freax

(free + freak + unix) but the site administrator liked Linux better. It was distributed under

the GNU Public License. (July).

PERL 5.000

The much anticipated Perl 5.000 is unveiled. It was a complete rewrite of Perl.

A few of the features and pitfalls are: (18 October)

Objects.

The documentation is much more extensive and perldoc along with pod is

introduced.

Lexical scoping available via my. eval can see the current lexical variables.

The preferred package delimiter is now :: rather than '.

New functions include: abs(), chr(), uc(), ucfirst(), lc(), lcfirst(),

chomp(), glob()

There is now an English module that provides human readable translations for

cryptic variable names.

Several previously added features have been subsumed under the new keywords use

and no.

Pattern matches may now be followed by an m or s modifier to explicitly request

multiline or singleline semantics. An s modifier makes . match newline.

@ now always interpolates an array in double-quotish strings. Some programs may

now need to use backslash to protect any @ that shouldn't interpolate.

It is no longer syntactically legal to use whitespace as the name of a variable, or as a

delimiter for any kind of quote construct.

The -w switch is much more informative.

is now a synonym for comma. This is useful as documentation for arguments that

come in pairs, such as initializers for associative arrays, or named arguments to a

subroutine.

Perl 5.001 is released. (13 March)

Perl 5.002 announced which introduced, among other things, subroutine prototypes and

sysopen(). (29 February)

4

Page 6: Beginning Perl for Bioinformatics-RVS

3.0 Bioinformatics Definition -General view

Bioinformatics derives knowledge from computer analysis of biological data. These can

consist of the information stored in the genetic code, but also experimental results from

various sources, patient statistics, and scientific literature. Research in bioinformatics

includes method development for storage, retrieval, and analysis of the data.

Bioinformatics is a rapidly developing branch of biology and is highly interdisciplinary,

using techniques and concepts from informatics, statistics, mathematics, chemistry,

biochemistry, physics, and linguistics. It has many practical applications in different

areas of biology and medicine.

4.0 Bioinformatics using Perl

Bioinformatics, the use of computers in biology research, has been increasing in

importance during the past decade as the Human Genome Project went from its

beginning to the announcement last year of a "draft" of the complete sequence of human

DNA.

The importance of programming in biology stretches back before the previous decade.

And it certainly has a significant future now that it is a recognized part of research into

many areas of medicine and basic biological research. This may not be news to

biologists. But Perl programmers may be surprised to find that their handsome language

has become one of the most - if not the most popular - of computer languages used in

bioinformatics.

4.1 Programming Concepts

Program = a text file that contains instructions for the computer to follow

Programming Language = a set of commands that the computer understands

(via a “command interpreter”)

Input = data that is given to the program

Output = something that is produced by the program

Programming

Write the program (with a text editor)

Run the program

Look at the output

Correct the errors (debugging)

5

Page 7: Beginning Perl for Bioinformatics-RVS

Repeat

(computers are VERY dumb -they do exactly what you tell them to do, so be

careful what you ask for…)

String

Text is handled in Perl as a string

This basically means that you have to put quotes around any piece of text that is

not an actual Perl instruction.

Perl has two kinds of quotes - single ‘ ‘

and double “ “

(they are different- more about this later)

Print

Perl uses the term “print” to create output

Without a print statement, you won’t know what your program has done

You need to tell Perl to put a carriage return at the end of a printed line

o Use the “\n” (newline) command

o Include the quotes

o The “\” character is called an escape - Perl uses it a lot

Numbers and Functions

Perl handles numbers in most common formats:

456

5.6743

6.3E-26

Mathematical functions work pretty much as you would expect:

4+7,6*4 ,43-27, 256/12,2/(3-5)

4.2 Variable

To be useful at all, a program needs to be able to store information from one

line to the next

Perl stores information in variables

A variable name starts with the “$” symbol, and it can store strings or

numbers

6

Page 8: Beginning Perl for Bioinformatics-RVS

o Variables are case sensitive

o Give them sensible names

Use the “=”sign to assign values to variables

$a = 100

$s = “ttattagcc”

4.3 String operation

Strings (text) in variables can be used for some math-like operations

Concatenate (join) use the dot . operator

$seq1= “ACTG”;

$seq2= “GGCTA”;

$seq3= $seq1 . $seq2;

print $seq3

ACTGGGCTA

String comparison (are they the same, > or <)

eq (equal )

ne (not equal )

ge (greater or equal )

gt (greater than )

lt (less than )

le (less or equal )

5.0 PERL PROGRAMS

5.1 PROGRAM NO.1

print "ENTER THE m-RNA SEQUENCE\n";

$a=<stdin>;

chomp($a);

$len=length($a);

print" THE LENGTH OF DNA SEQUENCE IS $len\n";

7

Page 9: Beginning Perl for Bioinformatics-RVS

$c=0;

$g='';

while ($c<$len){

$b=substr($a,$c,3);

if ($b=~ /AUG/)

{

$g=$g.'M';

}

if ($b=~ /(UUA)|(UUG)|(CUU)|(CUC)|(CUA)|(CUG)/)

{

$g=$g.'L';

}

if ($b=~ /(UCU)|(UCC)|(UCA)|(UCG)|(AGU)|(ACG)/)

{

$g=$g.'S';

}

if ($b=~ /(AUU)|(AUC)|(AUA)/)

{

$g=$g.'I';

}

if ($b=~ /(UUU)|(UUC)/)

{

$g=$g.'F';

}

if ($b=~ /(GUU)|(GUC)|(GUA)|(GUG)/)

{

$g=$g.'V';

}

if ($b=~ /(CCU)|(CCC)|(CCA)|(CCG)/)

{

$g=$g.'P';

8

Page 10: Beginning Perl for Bioinformatics-RVS

}

if ($b=~ /(ACU)|(ACC)|(ACA)|(ACG)/)

{

$g=$g.'T';

}

if ($b=~ /(GCU)|(GCC)|(GCA)|(GCG)/)

{

$g=$g.'A';

}

if ($b=~ /(UAU)|(UAC)/)

{

$g=$g.'Y';

}

if ($b=~ /(UGU)|(UGC)/)

{

$g=$g.'C';

}

if ($b=~ /UGG/)

{

$g=$g.'W';

}

if ($b=~ /(CAU)|(CAC)/)

{

$g=$g.'H';

}

if ($b=~ /(CAA)|(CAG)/)

{

$g=$g.'Q';

}

if ($b=~ /(CGU)|(CGC)|(CGA)|(AGG)|(AGA)|(AGG)/)

{

9

Page 11: Beginning Perl for Bioinformatics-RVS

$g=$g.'R';

}

if ($b=~ /(AAU)|(AAC)/)

{

$g=$g.'N';

}

if ($b=~ /(AAA)|(AAG)/)

{

$g=$g.'K';

}

if ($b=~ /(CGU)|(GGC)|(GGA)|(GGG)/)

{

$g=$g.'G';

}

if ($b=~ /(GAA)|(GAG)/)

{

$g=$g.'E';

}

if ($b=~ /(GAU)|(GAC)/)

{

$g=$g.'D';

}

if ($b=~ /(UAA)|(UAG)(UGA)/)

{

$g=$g.'#';

}

$c=$c+3;

}

print"THE AMINO ACID IN THE SEQUENCE IN 1ST ORF IS $g";

10

Page 12: Beginning Perl for Bioinformatics-RVS

RESULT

ENTER THE m-RNA SEQUENCE

AUCGAUCGAUGC

THE LENGTH OF DNA SEQUENCE IS 12

THE AMINO ACID IN THE SEQUENCE IN THE 1ST ORF IS IDRC

COMMENT

AIM: TO FIND OUT THE FIRST ORF IN THE GIVEN AMINO ACID SEQUENCE.

5.2 PROGRAM NO.2.

print "ENTER THE DNA SEQUENCE\n";

$dna=<stdin>;

chomp($dna);

$dna1=$dna;

$len=length($dna);

$dna=~tr/ATGC/UACG/;

print"\nmRNA: $dna\n";

print "\nLENGTH: $len\n";

sub dna

{

$i=0;

$b=3;

$p='';

while($i<$len)

{

$seq=substr($dna,$i,$b);

if ($seq=~/GC./i) {$p.='A';}

if ($seq=~/UG[UC]/i) {$p.='C';}

if ($seq=~/GA[UC]/i) {$p.='D';}

11

Page 13: Beginning Perl for Bioinformatics-RVS

if ($seq=~/GA[AG]/i) {$p.='E';}

if ($seq=~/UU[UC]/i) {$p.='F';}

if ($seq=~/GG./i) {$p.='G';}

if ($seq=~/CA[UC]/i) {$p.='H';}

if ($seq=~/AU[UCA]/i) {$p.='I';}

if ($seq=~/AA[AG]/i) {$p.='K';}

if ($seq=~/UU[AG]/i) {$p.='L';}

if ($seq=~/AUG/i) {$p.='M';}

if ($seq=~/AA[UC]/i) {$p.='N';}

if ($seq=~/CC./i) {$p.='P';}

if ($seq=~/CA[AG]/i) {$p.='Q';}

if ($seq=~/CG.|AG[AG]/i){$p.='R';}

if ($seq=~/UC.|AG[UC]/i){$p.='S';}

if ($seq=~/AC./i) {$p.='T';}

if ($seq=~/GU./i) {$p.='V';}

if ($seq=~/UGG/i) {$p.='W';}

if ($seq=~/UA[UC]/i) {$p.='Y';}

if ($seq=~/CU./i) {$p.='L';}

if ($seq=~/UA[AG]|UGA/i){$p.='*';}

$i=$i+3;

}

return $p;

}

print"\nFIRST READING FRAME ";

$q=dna();

print": $q\n";

print"\nSECOND READING FRAME ";

$dna=substr($dna,1,$len);

$p=dna();

print": $p\n";

print"\nTHIRD READING FRAME ";

12

Page 14: Beginning Perl for Bioinformatics-RVS

$dna=substr($dna,1,$len);

$x=dna();

print": $x\n";

$rev=reverse($dna1);

$rev=~ tr/ACTG/UGAC/;

print "\nREVERSE mRNA : $rev\n ";

print"\nFOURTH READING FRAME ";

$q1=dna();

print": $q1\n";

print"\nFIFTH READING FRAME ";

$dna=substr($dna,1,$len);

$p1=dna();

print": $p1\n";

print"\nSIXTH READING FRAME ";

$dna=substr($dna,1,$len);

$x1=dna();

print": $x1\n";

RESULT

ENTER THE DNA SEQUENCE

ATGCGTGACATG

mRNA : UACGCACUGUAC

LENGTH 12

FIRST READING FRAME : YALY

SECOND READING FRAME : THC

THIRD READING FRAME : RTV

REVERSE mRNA : CAUGUCACGCAU

FOURTH READING FRAME : RTV

FIFTH READING FRAME : ALY

SIXTH READING FRAME : HC

COMMMENT

AIM: TO FIND OUT 6 ORF’s IN THE GIVEN DNA SEQUENCE.

13

Page 15: Beginning Perl for Bioinformatics-RVS

5.3 PROGRAM NO.3

do{

print"*" x 50;

print "\nEnter E for ESSENTIAL AMINO ACIDS\n";

print "Enter N for NONESSENTIALS\n";

print"*" x 50;

$a=<stdin>;

chomp($a);

if ($a eq 'E')

{

print "Isoleucine(I)\n

Leucine(L)\n

Lysine(K)\n

Methionine(M)\n

Phenylalanine(F)\n

Threonine(T)\n

Tryptophan(W)\n

Valine(V)\n

Arginine(R)\n

Histidine(H)\n";

}

if($a eq 'N')

{

print "Alanine(A)\n

Asparagine(N)\n

Aspartate(D)\n

Cysteine(C)\n

Glutamate(E)\n

Glutamine(Q)\n

Glycine(G)\n

14

Page 16: Beginning Perl for Bioinformatics-RVS

Proline(P)\n

Serine(S)\n

Tyrosine(Y)\n";

}

$b= <stdin>;

chomp($b);

if ($b eq 'I')

{

print "Isoleucine\n

Chemical formula: C6H13NO2\n

Molecular mass: 131.18 [1] g•mol-1\n

Systematic name:\n

(2S,3S)-2-amino-3-methylpentanoic acid\n

Abbreviations: I, Ile\n

Synonyms:\n

{2/α}-amino-{3/β}-methylvaleric acid\n

3-methyl-{/erythro-}norvaline\n

Amino-sec-butyl-acetic acid\n

Amino(1-methylpropyl)-acetic acid\n";

}

if ($b eq 'L')

{

print"Leucine\n

Chemical formula: C6H13NO2\n

Molecular mass: 131.18 g•mol-1\n

Systematic name:\n

(S)-2-amino-4-methyl-pentanoic acid\n

Abbreviations: L, Leu\n

Synonyms:\n

{(S)-/L-}2-amino-4-methylvaleric acid\n

4-methyl-norvaline\n

15

Page 17: Beginning Perl for Bioinformatics-RVS

α-aminoisocaproic acid\n";

}

if ($b eq 'K')

{

print"Lysine\n

Systematic name (S)-2,6-Diaminohexanoic acid\n

Abbreviations Lys,k\n

Chemical formula C6H14N2O2\n

Molecular mass 146.19 g/mol\n

PubChem 876\n

Melting point 224 °C\n";

}

if ($b eq 'M')

{

print"Methionine\n

Systematic name (S)-2-amino-4-(methylsulfanyl)-\n

butanoic acid\n

Abbreviations Met,m\n

Chemical formula C5H11NO2S\n

Molecular mass 149.21 g mol-1\n

Melting point 281 °C\n";

}

if ($b eq 'F')

{

print "Phenylalanine\n

Systematic name 2-Amino-3-phenyl-propanoic acid\n

Abbreviations Phe,F\n

Chemical formula C9H11NO2\n

Molecular mass 165.19 g mol-1\n

Melting point 283 °C\n";

}

16

Page 18: Beginning Perl for Bioinformatics-RVS

if ($b eq 'T')

{

print" Threonine\n

Systematic name (2S,3R)-2-Amino-3-hydroxybutanoic acid\n

Abbreviations Thr,T\n

Chemical formula C4H9NO\n

Molecular mass 119.12 g mol-1\n

Melting point 256 °C\n";

}

if ($b eq 'W')

{

print" Tryptophan\n

Systematic name (S)-2-Amino-3-(1H-indol-3-yl)-propionic acid\n

Abbreviations Trp,W\n

Chemical formula C11H12N2O2\n

Molecular mass 204.23 g mol−1\n

Melting point 289 °C";\n

}

if ($b eq 'W')

{

print" Valine\n

Systematic name (S)-2-amino-3-methyl-butanoic acid\n

Abbreviations Val,V\n

Chemical formula C5H11NO2\n

Molecular mass 117.15 g mol-1\n

Melting point 315 °C\n";

}

if ($b eq 'R')

{

print"Arginine\n

17

Page 19: Beginning Perl for Bioinformatics-RVS

Systematic (IUPAC) name

2-amino-5-(diaminomethylidene

amino)pentanoic acid\n

Chemical data\n

Formula C6H14N4O2\n

Mol. weight 174.2\n";

}

if ($b eq 'H')

{

print" Histidine\n

Systematic (IUPAC) name\n

2-amino-3-(3H-imidazol-4-yl)propanoic acid\n

Chemical data\n

Formula C6H9N3O2\n

Mol. weight 155.16\n";

}

if ($b eq 'A')

{

print" Alanine\n

Systematic (IUPAC) name\n

(S)-2-aminopropanoic acid\n

Chemical data\n

Formula C3H7NO\n

Mol. weight 89.1\n";

}

if ($b eq 'N')

{

print"Asparagine\n

Systematic (IUPAC) name\n

(2S)-2-amino-3-carbamoyl-propanoic acid\n

Chemical data\n

18

Page 20: Beginning Perl for Bioinformatics-RVS

Formula C4H8N2O3\n

Mol. weight 132.118\n";

}

if ($b eq 'C')

{

print "Cysteine\n

Systematic (IUPAC) name\n

(2R)-2-amino-3-sulfanyl-propanoic acid\n

Chemical dat\n

Formula C3H7NO2S\n

Mol. weight 121.16\n";

}

if ($b eq 'A')

{

print"Aspartic acid\n

Systematic (IUPAC) name\n

(2S)-2-aminobutanedioic acid\n

Chemical data\n

Formula C4H7NO4\n

Mol. weight 133.10\n";

}

if ($b eq 'E')

{

print"Glutamic acid\n

Systematic (IUPAC) name\n

(2S)-2-aminopentanedioic acid\n

Chemical data\n

Formula C5H9NO4\n

Mol. weight 147.13\n";

}

if ($b eq 'Q')

19

Page 21: Beginning Perl for Bioinformatics-RVS

{

print" Glutamine\n

Systematic (IUPAC) name\n

(2S)-2-amino-4-carbamoyl-butanoic acid\n

Chemical data\n

Formula C5H10N2O3\n

Mol. weight 146.15\n";

}

if ($b eq 'G')

{

print" Glycine\n

Systematic (IUPAC) name\n

aminoethanoic acid\n

Chemical data\n

Formula C2H5NO2\n

Mol. weight 75.07\n";

}

if ($b eq 'P')

{

print" Proline\n

Systematic name (S)-Pyrrolidine-2-carboxylic acid\n

Abbreviations Pro,P\n

Chemical formula C5H9NO2\n

Molecular mass 115.13 g mol-1\n

Melting point 221 °C\n";

}

if ($b eq 'S')

{

print" Serine\n

Systematic name (S)-2-amino-3-hydroxypropanoic acid\n

20

Page 22: Beginning Perl for Bioinformatics-RVS

Abbreviations Ser,S\n

Chemical formula C3H7NO3\n

Molecular mass 105.09 g mol-1\n

Melting point 228 °C \n";

}

if ($b eq 'Y')

{

print"Tyrosine\n

Systematic name (S)-2-Amino-3-(4-hydroxy-phenyl)-propanoic acid\n

Abbreviations Tyr,Y\n

Chemical formula C9H11NO3\n

Molecular mass 181.19 g mol-1\n

Melting point 343 °C\n";

}

print "\nEnter Again press Y";

$y=<stdin>;

chomp($y);

)

while($y eq 'Y')

RESULT

ENTER E FOR ESSENTIAL AMINO ACIDS

ENTER N FOR NON ESSENTIALS

E

LIST OF ESSENTIAL AMINO ACIDS

I

Isoleucine

Chemical formula: C6H13NO2

Molecular mass: 131.18 [1] g·mol-1

Systematic name:

21

Page 23: Beginning Perl for Bioinformatics-RVS

(2S,3S)-2-amino-3-methylpentanoic acid

Abbreviations: I, Ile

Synonyms:

{2/α}-amino-{3/β}-methylvaleric acid

3-methyl-{/erythro-}norvaline

Amino-sec-butyl-acetic acid

Amino(1-methylpropyl)-acetic acid

To Start Again Press Y

COMMENT

AIM:TO DETERMINE THE INFORMATION ABOUT 20 AMINO ACIDS.

5.4 PROGRAM NO.4

print"*" x 30;

print "\nEnter 1 for ADENINE\n";

print "Enter 2 for GUANINE\n";

print "Enter 3 for THYMINE\n";

print "Enter 4 for CYTOSINE\n";

print "ENTER 5 for URACIL\n";

print "ENTER YOUR CHOICE\n";

$a =<stdin>;

if($a==1)

{

print "ADENINE\n

Systematic (IUPAC) name 7H-purin-6-amine\n

Synonyms 6-aminopurine\n

Identifiers CAS number 73-24-5 PubChem 190\n

Chemical data\n

Formula C5H5N5\n

Mol. weight 135.127\n

SMILES NC1=NC=NC2=C1N=CN2\n

22

Page 24: Beginning Perl for Bioinformatics-RVS

Physical data\n

Melt. point\n

360 - 365 °C (-265 °F)\n";

}

if ($a==2)

{

print "GUANINE\n

Systematic name 2-amino-1H-purin-6(9H)-one\n

Other names 2-amino-6-oxo-purine,2-aminohypoxanthine\n

Molecular formula C5H5N5O\n

SMILES NC(NC1=O)=NC2=C1N=CN2\n

Molar mass 151.1261 g/mol\n

Appearance White amorphous solid\n

CAS number [73-40-5]\n

Melting point 360°C (633.15 K) deco.\n

Boiling point Sublimes\n";

}

if ($a==3)

{

print "THYMINE\n

Chemical name 5-Methylpyrimidine-2,4(1H,3H)-dione\n

Chemical formula C5H6N2O2\n

Molecular mass 126.11334 g/mol\n

Melting point 316 - 317 °C\n

CAS number 65-71-4\n

SMILES CC1=CNC(NC1=O)=O\n";

}

if ($a==4)

{

print "CYTOSINE\n

Chemical name 4-Aminopyrimidin-2(1H)-one\n

23

Page 25: Beginning Perl for Bioinformatics-RVS

Chemical formula C4H5N3O\n

Molecular mass 111.102 g/mol\n

Melting point 320 - 325°C (decomp)\n

CAS number 71-30-7\n

SMILES NC1=NC(NC=C1)=O\n";

}

if ($a==5)

{

print "URACIL\n

Systematic name Pyrimidine-2,4(1H,3H)-dione\n

Other names Uracil, 2-oxy-4-oxy pyrimidine\n

Molecular formula C4H4N2O2\n

Molar mass 112.08676 g/mol\n

Appearance Solid\n

CAS number [66-22-8]\n

Melting point 335 °C (608 K)\n

Boiling point N/A\n

Acidity (pKa) basic pKa = -3.4\n

acidic pKa = 9.389\n";

}

print "\nTo Start Again press Y";

$y=<stdin>;

chomp($y);

}

while($y eq 'Y')

RESULT

Enter 1 for ADENINE

Enter 2 for GUANINE

Enter 3 for THYMINE

Enter 4 for CYTOSINE

ENTER 5 for URACIL

24

Page 26: Beginning Perl for Bioinformatics-RVS

ENTER YOUR CHOICE

1

ADENINE

Systematic (IUPAC) name 7H-purin-6-amine

Synonyms 6-aminopurine

Identifiers CAS number 73-24-5 PubChem 190

Chemical data

Formula C5H5N5

Mol. weight 135.127

SMILES NC1=NC=NC2=C1N=CN2

Physical data

Melt. point

360 - 365 °C (-265 °F)

To Start Again Press Y

COMMENT

AIM: TO DETERMINE THE INFORMATION ABOUT NUCLEOTIDES.

5.5 PROGRAM NO.5

print "ENTER THE AMINO ACID SEQUENCE\n";

$a=<stdin>;

chomp($a);

$x=length($a);

print "LENGTH:$x ";

@a=split('',$a);

$b= 0;

foreach $i(@a){

if($i eq 'G'){

$b = $b+75.07;

}

if($i eq 'A'){

$b = $b+89.09;

25

Page 27: Beginning Perl for Bioinformatics-RVS

}

if($i eq 'V'){

$b = $b+117.15;

}

if($i eq 'L'){

$b = $b+131.18;

}

if($i eq 'I'){

$b = $b+131.18;

}

if($i eq 'S'){

$b = $b+105.09;

}

if($i eq 'T'){

$b = $b+119.12;

}

if($i eq 'C'){

$b = $b+121.15;

}

if($i eq 'M'){

$b = $b+149.21;

}

if($i eq 'F'){

$b = $b+165.19;

}

if($i eq 'Y'){

$b = $b+181.19;

}

if($i eq 'W'){

$b = $b+204.23;

}

26

Page 28: Beginning Perl for Bioinformatics-RVS

if($i eq 'P'){

$b = $b+115.13;

}

if($i eq 'N'){

$b = $b+132.12;

}

if($i eq 'Q'){

$b = $b+146.15;

}

if($i eq 'D'){

$b = $b+133.10;

}

if($i eq 'E'){

$b = $b+147.13;

}

if($i eq 'K'){

$b = $b+146.19;

}

if($i eq 'H'){

$b = $b+155.16;

}

if($i eq 'R'){

$b = $b+174.20;

}

}

$c=$b-(18*($x-1));

print "The MOLECULAR WEIGHT of the sequence is $c";

RESULT

ENTER THE AMINO ACID SEQUENCE

AVLIST

LENGTH:4

27

Page 29: Beginning Perl for Bioinformatics-RVS

THE MOLECULAR WEIGHT OF THE SEQUENCE IS 414.16

COMMENT

AIM:TO DETERMINE THE MOLECULAR WEIGHT OF THE AMINO ACIDS

SEQUENCE.

5.6 PROGRAM NO.6

$b= <stdin>;

chomp($b);

if ($b eq 'G')

{

print " GLYCINE=C2H5NO2";

}

if ($b eq 'A')

{

print " ALANINE=C3H7NO2";

}

if ($b eq 'V')

{

print " VALINE=C5H11NO2";

}

if ($b eq 'L')

{

print " LEUCINE = C6H13NO2";

}

if ($b eq 'I')

{

print " ISOLEUCINE=C6H13NO2";

}

if ($b eq 'S')

{

28

Page 30: Beginning Perl for Bioinformatics-RVS

print " SERINRE = C3H7NO3";

}

if ($b eq 'T')

{

print " THREONINE = C4H9NO3";

}

if ($b eq 'C')

{

print " CYSTINE = C3H7NO2S";

}

if ($b eq 'M')

{

print " METHIONINE = C5H11NO2S";

}

if ($b eq 'F')

{

print " PHENYLALANINE = C9H11NO2";

}

if ($b eq 'Y')

{

print " TYROSINE = C9H11NO3";

}

if ($b eq 'W')

{

print " TRYPTOPHAN = C11H12N2O2";

}

if ($b eq 'P')

{

print " PROLINE = C5H9NO2";

}

29

Page 31: Beginning Perl for Bioinformatics-RVS

if ($b eq 'N')

{

print " ASPARAGINE = C4H8N2O3";

}

if ($b eq 'Q')

{

print " GLUTAMINE = C5H10N2O3";

}

if ($b eq 'D')

{

print " ASPARTIC ACID = C4H7NO4";

}

if ($b eq 'E')

{

print " GLUTAMIC ACID = C5H9NO4";

}

if ($b eq 'K')

{

print " LYSINE = C6H14N2O2";

}

if ($b eq 'H')

{

print " HISTIDINE = C6H9N3O2";

}

if ($b eq 'R')

{

print " ARGININE = C6H14N4O2";

}

RESULT

A

ALANINE= C3H7NO2

30

Page 32: Beginning Perl for Bioinformatics-RVS

COMMENT

AIM:TO DETERMINE THE MOLECULAR FORMULA OF THE AMINO ACIDS

SEQUENCE.

5.7 Program no. 7.

$a=<stdin>;

chomp($a);

print "original seq $a\n";

$a= reverse $a;

print" reverse seq $a\n";

$a=~ tr/ATGC/TACG/;

print "COMPLIMENTARY seq $a\n";

$a= reverse $a;

print "reverse complimentary $a\n";

RESULT

ATGC

ORIGINAL SEQUENCE ATGC

REVERSE SEQUENCE CGTA

COMPLIMENTARY SEQUENCE GCAT

REVERSE COMPLIMENTARY SEQUENCE TACG

COMMENT

AIM: TO FIND THE REVERSE SEQUENCE, COMPLIMENTARY SEQUENCE,

REVERSE COMPLIMENTARY SEQUENCE.

5.8 PROGRAM NO.8

$a=<stdin>;

chomp($a);

$l= length($a);

@a= split('',$a);

31

Page 33: Beginning Perl for Bioinformatics-RVS

$A =0;

$T =0;

$C =0;

$G =0;

foreach $i(@a){

if ($i eq 'A')

{

$A=$A+1;

}

if ($i eq 'T')

{

$T=$T+1;

}

if ($i eq 'C')

{

$C=$C+1;

}

if ($i eq 'G')

{

$G=$G+1;

}

}

print "Adenine = $A";

print "Cytosine= $C";

print "Guanine = $G";

print "Thymine = $T";

print "length= $l";T";

print "length= $l";

RESULT

ATCG

Adenine = 1 Cytosine= 1Guanine = 1Thymine1

32

Page 34: Beginning Perl for Bioinformatics-RVS

COMMENT

AIM: TO IDENTIFY THE NUMBER OF NUCLEOTIDES IN THE SEQUENCE.

5.9 PROGRAM NO 9

$a=<stdin>;

chomp($a);

$l= length($a);

@a= split('',$a);

$A =0;

$T =0;

$C =0;

$G =0;

foreach $i(@a){

if ($i eq 'A')

{

$A=$A+1;

}

if ($i eq 'T')

{

$T=$T+1;

}

if ($i eq 'C')

{

$C=$C+1;

}

if ($i eq 'G')

{

$G=$G+1;

}

}

print "Adenine = $A\n";

33

Page 35: Beginning Perl for Bioinformatics-RVS

print "Cytosine= $C\n";

print "Guanine = $G\n";

print "Thymine = $T\n";

print "length= $l\n";

RESULT

ATCG

Adenine = 1 Cytosine= 1Guanine = 1Thymine 1

LENGTH 4

COMMENT

AIM: TO IDENTIFY THE NUMBER OF NUCLEOTIDES AND THE LENGTH IN

THE SEQUENCE.

5.10 PROGRAM NO.10

$dna="D:/orf/as.txt";;

open(DNA,$dna);

@dna=<DNA>;

$dna=join('',@dna);

$dna=~s/\s//g;

print "$dna";

do{

sub A

{

$l= length($dna);

print"$l\n";

}

sub B

34

Page 36: Beginning Perl for Bioinformatics-RVS

{

$dna= reverse ($dna);

print" reverse seq $dna\n";

}

sub C

{

$dna=~ tr/ATGC/TACG/;

print"COMPLIMENTARY seq $dna\n";

}

sub D

{

$dna=~ tr/ATGC/TACG/;

$dn= reverse ($dna);

print "reverse complimentary $dn\n";

}

sub E

{

@a= split('',$dna);

$A =0;

$T =0;

$C =0;

$G =0;

foreach $i(@a){

35

Page 37: Beginning Perl for Bioinformatics-RVS

if ($i eq 'A')

{

$A=$A+1;

}

if ($i eq 'T')

{

$T=$T+1;

}

if ($i eq 'C')

{

$C=$C+1;

}

if ($i eq 'G')

{

$G=$G+1;

}

}

print "Adenine = $A\n";

print "Cytosine= $C\n";

print "Guanine = $G\n";

print "Thymine = $T\n";

$e= ($A* 313.21) + ($T* 288.20) + ($G * 329.21) + ($C* 289.19) - (18.02);

print "$e";

36

Page 38: Beginning Perl for Bioinformatics-RVS

}

print"\nenter 1 for length";

print"\nenter 2 for reverse\n";

print"enter 3 for complimentary\n";

print"enter 4 for reverse complimentary\n";

print"enter 5 for molecular weight of the sequence\n";

$a =<stdin>;

if($a==1)

{

A;

}

if($a==2)

{

B;

}

if($a==3)

{

C;

}

if($a==4)

{

D;

}

37

Page 39: Beginning Perl for Bioinformatics-RVS

if($a==5)

{

E;

}

print "\nTo Start Again Press Y";

$y=<stdin>;

chomp($y);

}

while($y eq 'Y')

RESULT

enter 1 for length

enter 2 for reverse

enter 3 for complimentary

enter 4 for reverse complimentary

enter 5 for molecular weight of the sequence

5

ADENINE=2

THYMINE=2

GUANINE=2

CYTOSINE=2

MOL. WT.= 2421.6

To Start Again Press Y

COMMENT

38

Page 40: Beginning Perl for Bioinformatics-RVS

AIM: TO DETERMINE THE LENGTH,REVERSE, COMPLIMENTARY,REVERSE COMPLIMENTARYAND MOLECULAR WEIGHT OF THE GIVEN DNA SEQUENCE USING FILE HANDLING.

6.0 APPENDIX

6.1 What is Perl?

Perl is a high-level programming language with an eclectic heritage written by Larry

Wall and a cast of thousands. It derives from the ubiquitous C programming language

and to a lesser extent from sed, awk, the Unix shell, and at least a dozen other tools and

39

Page 41: Beginning Perl for Bioinformatics-RVS

languages. Perl's process, file, and text manipulation facilities make it particularly well-

suited for tasks involving quick prototyping, system utilities, software tools, system

management tasks, database access, graphical programming, networking, and world wide

web programming. These strengths make it especially popular with system administrators

and CGI script authors, but mathematicians, geneticists, journalists, and even managers

also use Perl.

6.2 Variables & Data Types

a variable is a named location in memory that is used to hold data that may be modified

by the program. Perl has three scema for keeping data during program execution: scalars,

arrays of scalars (also known as lists), and hashes. Arrays are grouped scalars indexed by

number, while hashes are indexed by strings.

Scalars

The most basic kind of data structure in Perl is the scalar variable. Scalar variables can

hold both strings and numbers.

$bodytemp = 98.6;

sets the scalar variable $BodyTemp to 98.6, but you can also assign a

string to exactly the same variable:

$bodytemp = 'normal';

Perl will also accept numbers as strings,

$bodytemp = '098.6';

and still performs arithmetic and other operations on them.

Arrays

An array variable is a list of scalars, hence in perl they are often refered to as lists. They

have the same format as scalar variables except that they are prefixed by an @ symbol.

The following statements:

@valine = ("gtg", "gtt", "gta", "gtc");

@hydrophobics = ("valine", "leucine","isoleucine");

@weights(117.15, 131.17, 131.17);

assign a four element list to the array variable @valine and a three element list to the

array variables @hydrophobics and @weights.

Hashes

40

Page 42: Beginning Perl for Bioinformatics-RVS

Basically hashes are arrays which are accessed by a string. They are also refered to as

associative arrays.To define a hash we can use the usual parenthesis notation, but the

array itself is prefixed by a % sign. Suppose we want to store all the hydrophobic amino

acids with their molecular weights in a single data structure. It would look like this:

%molyweights = ("valine", 117.15,

"leucine", 131.17,

"isoleucine", 131.17);

@data is a list array that has an element for every string and scalar in the hash

%molyweights.

6.3 Quotes & Strings

\t tab

\n newline

\b backspace

\a alarm (bell)

\$ literal $

\@ literal @

\\ literal

(special characters)

6.4 Operators

Precidence

Use parentheses when in doubt.

Arithmetic Operators

Math in Perl

x**y exponentiation

-x negation

x/y division

x*y multiplication

41

Page 43: Beginning Perl for Bioinformatics-RVS

x+y addition

x-y subtraction

 

Auto-increment and Auto-decrement ++ and -- work as increment and decrement

Assignment Operators is the ordinary assignment operator

String Operators . Concatenates two strings. For example,

$a = 'winter'.'green'; # $a is wintergreen

6.5 Testing

 Primarily for Numeric Comparison

== TRUE if the left argument is numerically equal to the

right argument; otherwise FALSE.

!= TRUE if the left argument is numerically not equal to

the right argument; otherwise FALSE

< TRUE if the left argument is numerically less than the

right argument; otherwise FALSE

> TRUE if the left argument is numerically greater than

the right argument; otherwise FALSE

<= TRUE if the left argument is numerically less than or

equal to the right argument; otherwise FALSE

>= TRUE if the left argument is numerically greater than

or equal to the right argument; otherwise FALSE

<=> returns -1, 0, or 1 depending on whether the left

argument is numerically less than, equal to, or greater

than the right argument

 

Primarily for String and Character Comparison

eq returns TRUE if the left argument is stringwise equal to

the right argument; otherwise FALSE

ne returns TRUE if the left argument is stringwise not equal

42

Page 44: Beginning Perl for Bioinformatics-RVS

to the right argument; otherwise FALSE

lt returns TRUE if the left argument is stringwise less than

the right argument; otherwise FALSE

gt returns TRUE if the left argument is stringwise greater

than the right argument; otherwise FALSE

le returns TRUE if the left argument is stringwise less than

or equal to the right argument; otherwise FALSE

ge returns TRUE if the left argument is stringwise greater

than or equal to the right argument; otherwise FALSE

cmp returns -1, 0, or 1 depending on whether the left

argument is stringwise less than, equal to, or greater than

the right argument

6.6 Boolean Expressions

You can also use logical AND, OR and NOT to create more complex expressions:

($a && $b) Are $a AND $b TRUE ?

($a || $b) Is either $a OR $b TRUE ?

!($a) Is $a FALSE ?

6.7 Important Perl Functions

Any function in the list below may be used either with or without parentheses around its

arguments.

Input and Output

print - output a list to the screen or a file

SYNOPSIS

print FILEHANDLE LIST

print LIST

open - open a file

SYNOPSIS

43

Page 45: Beginning Perl for Bioinformatics-RVS

open FILEHANDLE, FILENAME

close - close a file

SYNOPSIS

close FILEHANDLE

close

String Functions

length - return the number of bytes in a string

SYNOPSIS

length EXPR

reverse - reverse a string or a list

SYNOPSIS

reverse STRING

reverse LIST

substr - get or alter a portion of a string

SYNOPSIS

substr EXPR,OFFSET,LEN,REPLACEMENT

substr EXPR,OFFSET,LEN

substr EXPR,OFFSET

index - left-to-right substring search

SYNOPSIS

index STR, SUBSTR, POSITION

index STR, SUBSTR

rindex - right-to-left substring search

SYNOPSIS

rindex STR,SUBSTR,POSITION

rindex STR,SUBSTR

Numeric Functions

abs - absolute value function

cos - cosine function

exp - raise e to a power

int - get the integer portion of a number

44

Page 46: Beginning Perl for Bioinformatics-RVS

log - retrieve the natural logarithm for a number

sin - return the sin of a number

sqrt - square root function

6.7.1 Metacharacters

Metacharacters are used to broaden the capabilities of a pattern to match multiple strings

or in specific locations. The following are recognized:

. Match any character (except newline)

^ Match the beginning of the line

$ Match the end of the line (or before newline at the end)

| Alternation

( ) Grouping

[ ] Character class

metacharacters

6.7.2 Character Classes: Perl also provides some predefined character classes. The

following can be used in place of their bracketed alternatives:

\w Match a "word" character [a-zA-Z_0-9]

\W Match a non-word character [^a-zA-Z_0-9]

\s Match a whitespace character [ \t\n\r\f]

\S Match a non-whitespace character [^ \t\n\r\f]

\d Match a digit character [0-9]

\D Match a non-digit character [^0-9]

predefined character classes

6.7.3 Quantifiers

* Match 0 or more times, same as {0,}

+ Match 1 or more times, same as {1,}

? Match 1 or 0 times, same as {0,1}

{n} Match exactly n times

{n,} Match at least n times

{n,m} Match at least n but not more than m times

quantifiers

45

Page 47: Beginning Perl for Bioinformatics-RVS

if - conditional branching

SYNTAX

if (EXPR) {BLOCK}

if (EXPR) {BLOCK} else {BLOCK}

if (EXPR) {BLOCK} elsif (EXPR) {BLOCK} ... else {BLOCK}

for - C-style looping structure

SYNTAX

for (INITIALIZE; TEST; INCREMENT) {BLOCK}

foreach - iterates over a list

SYNTAX

foreach VAR (LIST) {BLOCK}

while - loop structure

SYNTAX

while (EXPR) {BLOCK}

do {BLOCK} while (EXPR)

until - loop structure

SYNTAX

until (EXPR) {BLOCK}

do {BLOCK} until (EXPR)

46

Page 48: Beginning Perl for Bioinformatics-RVS

7.0 CONCLUSION

We have only touched the tip of the iceberg here. Beyond just pure Perl projects, we could also manage C & Perl joint projects under this infrastructure. The infrastructure is built in Perl, which means that it is extremely portable, running on platforms ranging from Linux to Windows to S/390. Once we can get used to this infrastructure, we will find it totally invaluable for all the projects you work on. We will never have to write an install script again, and through the use of well formed test cases, you can have a far higher level of confidence that our program is performing the way it was intended.

Perl scripts which build dynamic data for a web site, and are already coded to return HTML data, can benefit from offering PDF output options to users. Relying on the external program HTMLDOC, which already does all the hard work of transforming HTML into PDF.

We're the first to admit that calling HTMLDOC externally is not the most elegant solution in the world -- sometimes, though, sheer functionality and the smile on your little user's faces is worth more than any elegance!

47