a comprehensive package for dna-sequence analysis in fortran-iv for the pdp-11

Volume 14 Number 1 1986 Nucleic Acids Research

A comprehensive package for DNA sequence an%iysis in FORTRAN IV for the PDP-I1

Jonathan Arnold, Virginia K.Eckenrodel, Klaus Lemke, Gregory J.Phillips and Stephen W.Schaeffer

Department of Genetics, University of Georgia, Athens, GA 30602, USA

Received 26 November 1984

ABSTRACTA computer package written in Fortran-IV for the PDP-11 minicomputer is

described. The package's novel features are: (i) software for voice-entryof sequence data; (ii) a less memory intensive algorithm for optimal se-quence alignment; (iii) and programs that fit statistical models to nucleicacid and protein sequences.

INTRODUCTION

The widespread use of rapid methods for DNA sequence analysis (32, 41)

and the availability of over five thousand published DNA sequences in

computerized data bases, such as GENBANK (TM), has mandated the development

of computer packages for the collection (36), organization, and statistical

analysis of these nucleic acid sequences (44,45,46,58). We present one such

package named DNASEQ developed at the University of Georgia. The package

contains several novel features which include: (i) software for the voice-

entry of nucleic acid or protein sequence data; (ii) a less memory-intensive

version of the Needleman and Wunsch (34) algorithm for sequence alignment;

and (iii) programs that fit statistical models to nucleic acid and protein

sequences.

SOFTWARE DESIGN AND HARDWARE REQUIREMENTS

The programs listed below in the package DNASEQ are written in Digital

Equipment Corporation's (DEC) FORTRAN-IV, Version 2.5, to run under DEC's

RSX-11M (TM) operating system, Version 4.0, on a PDP-11/34A (TM) minicomput-

er. Each program executes a single basic operation and can be used as a

building block in a sequence of operations. Thus, the output of one program

Source programs will be provided upon receipt of a self-addressed mailinglabel, a blank 9-track tape, and $60 check to cover processing.

© I R L Press Limited, Oxford, England. 239

Nucleic Acids Research

serves as the input to another program. Each program is written to run

autonomously. The command file DNASEQ.CMD invokes the indirect command

processor in RSX-11M and ties the programs together into a menu-driven

package. The experienced user can bypass the command file, and hence the

menu, to run the programs directly. Through the use of direct access files

and the Fortran-77 compiler, sequence size limitations of nearly all the

programs have been removed.

The driver for the menu utilizes RSX-IIM utilities, such as the EDT

editor. The result is that the package DNASEQ is tied into and makes

extensive use of the RSX-11M operating system. The FORTRAN-IV programs are

written to be nearly machine-independent. Parts of the programs employing

special features of the RSX-11M operating system and its overlay product

FORTRAN-IV, V2.5, are clearly indicated in the programs. For example, the

occasional use of virtual arrays is made. The programs are similar to those

of Staden (47,48,49), Gingeras, Milazzo, Sciaky, and Roberts (18),

Stockwell, (54), Devereux et al. (8), Queen and Korn (40), and Keller et al.

(24).All the programs in the package DNASEQ share the same common block

structure to allow the user to modify easily or to add to the package. The

programs check all user responses and instruct the user when a mistake is

made.

The hardware requirements are a DEC PDP-11 minicomputer operating on

the UNIBUS (TM) with at least a floppy disk drive and one terminal (e.g.,DEC VT100, LA34, or LA120). The PDP-11 should be operating under RSX-11M,V4.0, with the overlay product, FORTRAN-IV, V2.5. To operate the voice

entry software, an Interstate Electronics Voice Recognition System (VRS)SYS300 is needed. The VRS uses software written in FORTRAN-77 and MACRO.

The FORTRAN software calls a MACRO, object file library called DIRECTIVE.OLB

created from the RSX1lM operating system's (V4.0) executive library. To

operate some of the graphics software, such as the program DOTMATRIX de-

scribed below, an Advanced Electronics Design (AED) 512 color graphics

terminal is needed together with its collection of FORTRAN callable subrou-

tines called TAP.

PROGRAMS

The main features of the package are programs for the voice entry of

sequence data, several programs for the exploratory analysis of sequence

data such as HOMOLOGY and ENVELOPE, and a program called MARKOV for modeling

240


sequences. The package is designed to work in conjunction with GENBANK

through the programs EDIT and GBKFORMAT. The starred (*) programs can be

run on nucleic acid or protein sequences. All other programs can only be

run on nucleic acid sequences.

MENU

This command calls up a brief description of each of the programs

available in DNASEQ.

DIRECTORY

This command calls the DEC system utility PIP (10) to obtain a list of

files on one or more user specified accounts.'C

DATAENTRY

Voice entry of sequence data, using the SYS300 voice recognition system

(VRS), is more accurate than keyboard entry and is faster than entry through

a digitizing tablet (2).

The SYS300 voice recognition system is compatible with almost any

terminal and the PDP-11. The programs in the submenu DATAENTRY send com-

mands to the VRS to permit voice entry of data. This submenu calls five

programs, UPLOAD, DOWNLOAD, TRAIN, TEST and ENTRY. The program TRAIN,

trains the VRS to recognize a user's pronunciation of a user-selected

vocabulary, such as the amino acids or nitrogenous bases. The VRS's imprint

of the user's voice during training is stored as a reference pattern

against which utterances are compared during data entry via the program

ENTRY. Program UPLOAD transfers a vocabulary and corresponding reference

pattern to the PDP-11/34A for permanent storage, so when the user turns off

the VRS, he can return at a later time and recall his vocabulary and refer-

ence pattern with the program DOWNLOAD. He can continue to enter more data

with the program ENTRY. The program TEST allows the user to examine the

quality of voice recognition and thus to modify the vocabulary to achieve

more faithful recognition.

The user selects the stringency with which an utterance is recognized,

and the data is stored in the same format as a sequence file created by the

program FORMAT described below. An example of a conversation with the

program DATAENTRY is given in Figure 1.

FORMAT

This program takes a user-entered sequence in varying format with

arbitrarily placed spaces and varying line length, and reformats it to be

compatible with all programs in the package. Program FORMAT is run on a

sequence file prior to running any of the other programs.

241


COMMOS FOR VOICE ENTRY OF SEGUENCE DATAINTERSTATE ELECTRONICS VOICE RECOGNITION SYSTEM (VRS)

SYS 300-COMMAND- -FUNCTION-UPLOAD SAVE VOCABULARYDOWNLOAD RETRIEVE VOCABULARYTRAIN IRAIN VRSTEST TEST VOCABULARY RECOGNITIONENTRY CREATE DATA FILESTOP NO MORE VOICE ENTRY

*ENTER A COMMAND: TRAINDirectory of available vocabularies:

AMINOACID.VOCBASE.VOC

What is Your choice: BASE.VOCNo. of vocabulary words stored in memory: 11

TRAINING SECTIONsay BaseSay Adeninesay GuanineSay Thyminesay UracilSay Cvtosine

TRAIN IS COMPLETED.

ENTER A COMMAND: ENTRYEnter a "Diskfilename.DATO UVRD.DATTURN ON THE MICROPHONE.

ATTGATATAATCAGCAAATCTGTATATATACCCAG

TRANSFER COMPLETE

*ENTER A COMMAND: UPLOADEnter a "Diskfile.VRT": UVRD.VRTThe information is stored in UVRD.JRTUpload is oomrleted

*ENTER A COMND: STOPTT22---STOP

Figure 1. Typical conversation with the program DATAENTRY. (See text fordetails).

*PRINT

This program takes a formatted sequence file and prints the sequence in

several formats. The number of letters on a line, the frequency of line

numbers, and the segment printed can be selected. The program will allow

the user to print letters in blocks sizes of 1 to 10. The output can be

routed to a file and saved for later use.

GBKFORMAT

This program reformats a sequence extracted from GENBANK and creates a

242


new file for use by the package DNASEQ. Reformatting involves removing some

header and tail information surrounding sequences in GENBANK, removing line

numbers and blanks interspersed in the GENBANK sequence file, and adjusting

the number of letters per line.

EDIT

The command EDIT in the menu accesses the RSX-11M operating system's

full screen editor EDT (9) from the package. The package then capitalizes

on existing software on the PDP-11. The INCLUDE command described in the

EDT editor manual (9) allows access to all of GENBANK. Other EDT editor

commands such as CUT, PASTE and FIND are useful in manipulating the sequence

data.

COMPLMENT

The program COMPLMENT takes a formatted nucleic acid sequence file and

converts all letters to their complement (A to T(U) or G to C). The result-

ing sequence is routed to a new file.

REVERSE

This program takes a formatted sequence X(l),X(2),...,X(n) and reverses

its order to X(n),X(n-l),...,X(1). If a user first runs the program COMPL-

MENT on a sequence A to generate sequence B and then runs the program

REVERSE on sequence B to generate sequence C, then sequence C will be

complementary to sequence A, written in the same order (i.e., 5' to 3').TRANSLATE

The program TRANSLATE performs several operations on formatted nucleic

acid sequences: (i) finds the longest amino acid coding region within a

sequence (open reading frame); (ii) displays all open reading frames; (iii)translates a specified region of a nucleic acid sequence into amino acids;

(iv) prints a translated sequence; (v) saves the specified region in a

specified reading frame as input to the program USAGE for the determination

of codon usage.

CONVERT

This program converts a protein sequence into either the standard three

letter or one letter amino acid code. The program CONVERT takes the output

from the program TRANSLATE and converts it into a form usable by the pro-

grams HOMOLOGY and ENVELOPE.

USAGE

The program USAGE takes a nucleic acid sequence file created by the

program TRANSLATE and determines codon usage. The program also tests

whether or not the observed amino acid frequencies fit those expected based

243


on the hypothesis of equal frequencies for synonomous codons versus the

hypothesis of different expected frequencies of synonomous codons. The

degeneracy of the genetic code predicts equal frequencies for synonomous

codons. Lewin ((29), p. 70, Figure 4.6) presents evidence that it may be

reasonable to hypothesize that codons are equally frequent. A test of this

second hypothesis is also made. A simple likelihood ratio test is performed

in each case, and residuals are presented as in Anscombe (1) to indicate the

direction and magnitude of departures from the null hypothesis.

RESTRICT

This program searches for restriction endonuclease recognition sites in

a formatted DNA sequence file. Our package includes over 95% of the re-

striction endonucleases listed in the New England Biolabs Catalog. Restric-

tion enzymes not currently listed in our package may be added by any user.

In this way, the program's restriction enzyme list grows symbiotically from

its users. The user can select an individual enzyme or a class of enzymes.

The two classes of enzymes are: blunt-ended versus 5' or 3' sticky-ended

products, and recognition sequence length (6 base, 5 base, etc.). As an

example, a map of restriction endonuclease recognition sites is given for

the uvrD gene of E. coli is in Figure 2.

The program RESTRICT lists the number of recognition sites and their

positions for each enzyme selected. A map of the restriction enzyme recog-

nition sites is displayed in a line plot for each endonuclease (Figure 2).Those endonucleases which have no recognition site within the DNA sequence

are also listed. Lastly, the program predicts the lengths of the DNA

fragments generated by a digest with each restriction enzyme. These pre-

dicted fragments are listed in descending order of length.

FREQUENCY

The program FREQUENCY determines the nucleotide composition of a

formatted sequence. Frequencies of letters (e.g. nucleotides) and frequen-

cies of pairs of letters (e.g. dinucleotide frequencies) are reported. A

likelihood ratio test is performed to test the hypothesis of independence in

the random sequence model described below under MARKOV. A line plot for the

distribution of pairs of letters along the formatted sequence is also given.

DIFFERENCES

This command calls the DEC system utility CMP (10) to compare two

files. The call is useful in checking for errors during the entry of

sequence data.

244


IHIS PROGRAM LOCATES RESTRICTION SITES WITHIN A NUCLEIC ACID SEQUENCE.WHAT IS THE NAME OF THE SEWENCE FILE? UVRD.SEQYOU HAVE AN ENZYME LIST ON YOUR ACCOUNT.CREATE A NEW LIST OF ENZYMES?(Y/N] NGIVE A TITLE FOR YOUR SEQUENCE? UVRDWHAT RANGE OF THE SEDUNCE DO YOU WISH TO MAP?BEGINNING AT? I ENDING AT? 1190

RESTRICTION SITE SEARCHING DATA FOR UVRD.SEUTABLE 1. GIVES THE NUMBER OF SITES AND THE LOCATION OF SIIES FOR EACH ENZY"E.

RECOGNITIONENZY"E SITE SITES SITE LOCATIONSAFLII C^TTAAG 1 783AHAII GR^CGYC 1 727ALUI AG^CT 5 42 358 773 98 989

STUI AGG.CCT 1 874TAGI T"CGA 4 436 582 757 811^ DENOTES THE CLEAVAGE SITE

THE FOLLOWING ENZYMES FOUND NO RESTRICTION SITES:ACCIAATI IAHAIII

XMI IIXMNI

FIGURE 1. GIVES A MAP OF RESTRICTION SITES FOR RANGE BEGINNING AT I AND ENDING AT 119

1 119 237 355 473 591 709 827 945 1063 181I- -- - I-----__ ---I_ --_ -----_ I- --------__ I- --------__I-_ -------___ - --------__ I- -------_-_I- -----_ --_ I-_________I

AFLII ------------------------I- ----------------------------------

AHAII----------------------------------i--------------------

ALUI --I ----- I--- ---- --- --- I------- -----------------

STUI ..

TAGI ------------------------------------,--- --, .I-----_,----- - -----

1 119 237 35b 473 581 709 827 945 1063 1181

DENOTES A SINGLE CUT* DENOTES MULTIPLE CUTS

TABLE 2. GIVES THE NLMER OF BASE PAIRS IN RESTRICTION FRAGMENTS FOR ENZYMES WITH MORE THAN ONE RESTRICTION SITE.

ALUI FRAGENTS: 415 316 125 91BANI FRAGMNTS: 3168SSHI5I FRAGENTS: 920

TAGI FRAGMENTS: 175 146 96

Figure 2. Sample Output of the program RESTRICT. Example shown is therestriction endonuclease site analysis for the E. coli uvrD gene (11).

245


HOMOLOGY

This program finds the optimal alignment of two formatted sequence

files, using a dynamic programming algorithm proposed by Needleman and

Wunsch (34). Kruskal (28) gives a clear exposition of this algorithm, its

variations, and applications. Briefly, we have two sequences X = (X(1),

...,X(m)) and Y = (Y(l),...,Y(n)). We introduce a measure of evolutionary

distance d(X,Y) between two sequences. The distance d(X,Y) reflects the

number of 'mutational' steps needed to generate Y from X. Mutational steps

can be insertions, deletions, or substitutions.

If we form the path matrix of X and Y as in Figure 3, an alignment of X

with Y represents a transverse path from the lower right corner of the path

matrix to its upper left corner. Different sequences of mutational steps

converting X to Y correspond to different transverse paths from the lower

right to upper left corner of the matrix. The 3 mutational steps: inser-

tions, deletions, and substitutions, correspond to horizontal, vertical, and

diagonal steps, respectively, along a transverse path.

The dynamic programming algorithm finds the transverse path through the

path matrix of minimal mutational cost (e.g. minimal evolutionary distance).

In the program the user assigns weights to the different mutational steps

along a path from X to Y (see for example, Erickson and Sellers (14, Table

2)). A number of default weight matrices are built into the program HOMO-

LOGY. When a user increases the weight attached to a given mutation (e.g. A

to T), then the user lowers his assessment of the frequency of that muta-

tional event. The symbol w(X,Y) denotes the weight attached to the conver-

sion of letter X to letter Y. Letter 0 will be taken as the null character.

The mutation X to 0 will represent a deletion, and the mutation 0 to Y, an

insertion (X and Y not equal to 0).This program calculates two entities: (i) a minimal evolutionary

distance d(X,Y) and (ii) all optimal alginments of the 2 sequences X and Y.

First, the program processes the path matrix row by row until it reaches the

mth and last row. When the program reaches the cell (i,j) in the ith row

and jth column of the path matrix during this processing, it has computed

the mutational cost involved in converting the subsequence X(i)=(X(1),...,-X(i)) to Y(j)=(Y(1),...,Y(j)). The mutational cost accumulated in the (i,j)cell is calculated recursively in terms of the net costs d(X(i-1),Y(j)),d(X(i-l),Y(j-1)), d(X(i),Y(j-1)) to reach each predecessor cell, (i-l,j),(i-1,j-1), and (i,j-1), and the cost of the last operation to reach the

246


I A..... ..... ..... .....:...........:........................:........... I............:........... ........... !........ ... !........... I.....----- ----- ----- ........ ........ ........ ........ ........

III1- 21 1- J, 11 .. 1- 31 1- .. 1- ?. 1- S. I. 9, I.... 1-ti- .... I. Iji I'M 151 1-161 PM 1-10 191 I'M 1-211 I'ZWI I'iJI I-,41 1-n 112b, 1-27, 1-261 IIZVI 1-301 1-311 11321 1-331 11341 1-351-------------------------------------------------------- --- -------

to 0- 11 2:261: 01:'I 2. 1: I 1: 1:311: ..I: ..I. I..",0: !: f: 11: 1: .1; I: I.:I111 '71I. 2 1- I ':2; 0 1-24, 1-251---- ----- ----- ----- ---

':.Il 01:.'I: fl'...- -

k. 1- 11 1. ':2'41 t. 5. 21, 6, 7. I. I. I: 1: 1:".. ':.I PQ ':.I 2

A. 1. k- 11 1:.I: 11:..I: 2:2,1: 0 1: !:,,I 1: 1:,l 1:,I: 1:3.1: 0 1:------ ---------- -

21 1. 1: 21: 7: 11: I.: .1: 1. 11 1. I.: 11: 11: I: 11 .1 I., 111:1.1: I:.!: la 11: O.:..,: 11:111: 11:1.1: 11:1.1: 11 ------I :,-: _.. -

- --- __ __ --- __ __ __ __ __- __-__ __ __ __ __ __ - __ __ __ __- __- - - __ __ --- __--- --- __ __-- __ - __ __ __ __ __---_;____--------____ __---------- ---------__- __ ____ ____----__ __ __ __ __ - __ __ '_ -----------11 2, 1, 2 :,I: 2 1:.I: 1:.I: 1:.I: 2:

to :.O 1:1:. 111.1 1: 11:.. :211. o.:1.1, 00': 0 1: ':211: I 1. 1:1.110 1: 1:.4111 1: '1:. :2- %31 I I I' bl 111- 0: It I:.,

-------------- ------ -------- -------- ------ -------------------------------II,--- 1. 0- 1. 11 0- 11 I..I: 1:,.I: 2 2, 1: %1:1 '1:2

061 1- 5.1 .1 3: " 2- :2: '1:

-

Al- 11 '. 11 2: 1: .11: .1 V 1. 0 1: I:nl: 2:.I: 1:,,I: 0:.I:1:I1: 1): 1: 2: 1: 7:0 I':.6 0:2271 11 6. 51 I O.:I1:I 1: 1: 0 0:..---

2- 1. 1: 0: .1: 2: 13 N( 1.2 V31 O- 1. 1. 1,.1 2- II 2 2 0III I1- S. 1- 7. 1- S. 0 ': ':2. 2I :I3 I-224'":.O,,, 1:.11: 7:.",: 1:.11: :.I 1:.6 1:.7

--- -------- -------- --- ----------To--- If 11 2: 1: 0: 61: 2.. 1: 0: II 2. II I2; .,: ,Iz- A t, 2. 1- , :..I: %1: I.,l: %1: 2:.I: 1.,I: 2..1, 1:.I:

4, N 1. 4. 6- 1- 0

- --------------------------1: 2. 1. 1. 11 21 11V 0- I

211 2 1: 1:131: 2:,.I: 1:.,I: :..I: 2:.1 1: 2...1, 0 1. 1:.I: 0 1: 1: 2: 1:

1. 1:a 7: 2: !. ':11 51 1- 4- 1- 3- 1- 1- 3: 1- 7 21

-----------------

1: 2. .1. 0: JI: 11 .1: 1: 1: 1: 1: .1: 1: 1: 0 1: 1: .1: 0:.!: 1:.21: 2I: to 1:.11: I., '1:171 O.:.Ill I!2.7: ----- - - _.:.1-------------------------------------------------------------------2 21 2 1; 2 1; Z 1: 2: 1: 1. 61"------ :__ __:__ __:----- --------- 11 2- 11 2. 1- 22 ':23-10- 1- 9. 1. 6- 1. 1 1-11, '1:1;: 1:2---121 1-111 7, ,6, 1- 4- 1- 31 1- 61 INO 1- 61 1- 1711 I: .1: 1: I., 1:1.1: 1:111: 1:.': 7:131: P. 'r.": :.II

--------------------- -------- -2: 0 2-I 2- 1- 21 -. 21 11 21 11 0- 11 2:1:: e:,;: 2 0:.. I ': 0:21': ' ': 0 ': ':2.2...1): 1: 91 1: .1 I: 17: 11 .1 .l 11: 0. I: 1. 11 !II)$ 161- 1- bil 111 4111 1- S: 1: 1- 61 1. 71 1- 1111I 11 1-101 ':II 1:. I:U :23

1: I 1: 0 1; 2: .1: .1; 1. 1; 1: I 1: 2: 0 1: 1:0I: Z: 11 2- 1. .. 1. 0- 1: 1: 91,1 2:1.1: 11 1. 1. 11 O- 1. 2. 11 0. 1. %1: 1:,.I: 2:.I: ):.I::.I :.O 7: 4 11 II "7. .1.. 1.12..12': .11111 11.61- --------------------

- ----- ; IIN 1. 1. 2 1, 2: 91,A. P O- 11 31 1.0 ': ':I' ': ':I.'': ' ': 7:11111 7: ":.11 -201 '-2.-:10 1:1: '11 41- .1 1- .1 6. 7. 1,41 7, I- '1:.. P.11 ':0 1:.!: ':.5

I 1: I.. it 1: 'I, 0. .1: 2. 1, 1: e: 1, 0- 1, 1. 1- 2. 1, 2. 1, 2- 11 0- 11 I 11 0- 1.2 11 0. 11 !.16. 0g. 1. 91 illoo 1-11, 1-12- 11131 1-140 1-151 1:.7 1:1., 1:

P 7.------------------------2. 1, 1- 1- 11 II 2: II., 1: 2: $I , 1: 1: .1: .1: :.O,: 2:,21: 0:.31: 1:,41: 1:.,I: 1:161: 1:.71: 2:..1o 1:..Il 2: .1:2I: I:to to41 I I----------I--- ---- -----

11 0 1: .1: 1: .1: 0 2: I, 2: 1: 1: 1: .1: 11 2. 11 2- 11 11 I- 1-111: 1:1.1: 1:.,I: I:oo: 1:.,I: ll:..I: 2:.,I: 0 1: 0:.1 1: 2: .1:II 1. 9. I.I.. :.. 7:-----------

:..---- -------I-------- --

0:.,,: 1:1,1: 1:..I I'l 1:..I: 1:,I.111,1: 0 1. 1, II 2 If2 , 1: 1:.o, I'll, 1: 2. .1. 2, 1. 1...1.1 1: 9: 1:,.I: 2: .1, 2...I: 1.1: .1: I 1: 1:I 2:12 -1 I: to to1,

2:.:: I..,;1:1:.,l ---- - - __ VG-. 1- 0- 11 1. .1, 2: .1: 1:..I: 1:.;: 2:.31: 2:..I: 2:,51.1 2:..I: 1:117: 1"..I: 1:..I:1119, 1-18, I'll,I 1. :,0

------------ ----

11 2- 11 I: 1:".: 1:.,.: 2:,..: 1:".: 1:...: 2:.,.: 2:1"4'1:10 Pool '1421 -.11 ':.. -------------------------- - - ----- -------1. 11 0- 11 11 11 1. 11 11 11 1. 11 2 11 T 11 .1: 1:,01: I,:.I 1:.,I: 2:..Il 0 1:02I 1: f4i, 11 IOPM '42 Pool O.:I;l 21-11- Ilioi I. 9:,--'121 1-211 '1201 1-19, P ool. 1-17, 11161 1:15, .I:I-I '1:13 I'll, 11101 'P oll 7.121 'I 11 'I 2

%i., ,:.I: 2:.31: 1;..I: 0 1: 1: .1: 1:,21: 0 2:1.11 0 2. .1 2: 1. 1: I: 2:,01: 1:.1 1: 2:,:: 1;.,I: 1:1.1:1: 046 2:.I: 1:.,': 141I I':W 2120 -410------------ ---

ll:..I: 1:..I: 2:.;: 1:.o: 2: 1: 01 II 21 11 11 Zo 11 0- 10 2- 11 0 1: 2 1: 21 1: 2:.41: 11 1:02 ",O1: 11: I., :1411: 1411: 1:111: 11:1:: 1:.1: 042 1:1;: '14311 .43 P 11. I. So IP 11I I. of Iltoo 1:11: ':.:D :.Z :2. .4.

: 11 21 0- 11 21 I: 10 Z' 11 2- 11 21 2- 11 2- It 2- 1: Z:141:2;..I: 2:.,I: I...1: 01.31--: 1:.31, 1:..I: I1: 2:.21 0

- ----- ---- ---- ----------- ----- ----- ---- -------:25 2-11 'I . 1.

2:1-17-0 T: 1:,.I: :,31: 1.1:.21 1:.311 1:..I. 2:.I. 2.0 2 'I 0- I- 2 01Z: Ill 112 if 0- 11 21 11 2- II 2:1.191 .18. 1.141 1.15 .1 :I,21.zll '1174, 'I'M '1122, '1121- ... 1.11, 111-14, 1-15, 'I go it 91 1-101 1-111-1 1 -13

1.--. - ------ 2 201; 01: '1': 24 1: ': ':.I 1:1. It .1-271 -281I':25 123 1:2;; '1:211 0.:.l '1:1111I '116 -.1 I7- .1. '441 Poll :.- .43 V.- 141 -1 -1 ---; ; 1:..,: 2:..Il 0 11:.,I: 0 11 2:,01: 1:I If 20 If 21 ti of 1:A, II 2:101:.I. IIIto I' 9:1: 2 12 1: 0 II 0:2.I 1): 16 1: 1:1.1 I':.': 10:111411i 11251 10:241' 1-23 II22. ':2.1 '1:..: ZI:.. -.1120, P271I

I._ 1, 2:211 11 2:n1.1 1:,1.1 2:.I:2 2:ZI, 0 2 1:0 01'.1; 1:.I: 2:..I:0 11;..I: 21.11. 1:..I: 1:..I. 21 I: 2:..,., 1:.,I: 11,41: 1:.,I: 2:.,I: 2:"I: 1:,.I: 0 12 1.11 2:,,I::. I:.. to :.I. I.,111I__ .29, 19. '1:21 V :22 7 71 1- 7 1,

if 2- it 2. 11 2- 2. .1. 0:..I: 1:.41: I 1: 1:,91: 2:..I: 01.81: 2:.911 21 0-..,: Z:',':.30: 1.291 1,2.1 1,27, .261 '1:24 1 .11L - ----

:"I: .1.1 1:2.1: 1:.I: 1:.,: .:.I: 2:.I: Z:2.1: 2:2,1.1 0:2,1: 2:..I: 1:.91: 0:..I: 1:.I: 2,1.1: 01.01: 2:.I: 1:1.1.1 1:,.,: 11.1,: 011.1.1 1:.,I: Z:..Il 2:.Io 1:..I: 2:"I: 2:1.1: 1:.,I.I 1%,I: 1.1; ..I:Zo 1. 7 :2:,. III :,I.1,101

of 1. 1. 2:.I: 1:,Il2 11 1: 11:'.1: 2:'I to 11 11. 11 11 to0 0 to '1:.7. '1:111111 01:0. "I'llI IO'..,." 2: 1::30 ':. ':21 11 10.:2-1 to-32- 1-31, -2. 7 ':11111.

Figure 3. Path matrix which represents calculation of the optimal align-ment, and relatedness of two sequences. Upper X-axis represents the source

sequence (E. coli uvrD gene, 5-prime region), and the Y-axis represents thetarget sequence (E. coli lexA gene, 5-prime region). The transverse pathtraced between tEe corners represent the mutational steps requires toconvert the source sequence to the target sequence. Separations of thetransverse lines represent different "paths" used to achieve optimal align-ment. See text for full explanation.

current cell (ij):d(X(i-1),Y(j + w(X(i),O), (delete XM )

d(X(i),Y(j) = min d(X(i-1),Y(j-1)) + w(X(i),Y(j)), (substitution) (1)d(X(i),Y(j-1 )) + w(O,Y(j)), (insert Y(j))

247


The program HOMOLOGY calculates all the accumulated evolutionary distances,

d(X(i),Y(j)), until the net evolutionary distance d(X,Y) is reached. This

distance is calculated without reference to the specific mutational path

taken to convert X into Y. These calculations are shown graphically in

Figure 3.

The net evolutionary distance d(X,Y) is then used in a second pass

through the path matrix, to find the optimal alignment(s). The evolutionary

distance d(X,Y) is used to excise unneccessary cells from the path matrix as

in Fickett (15). For example, if on the second pass through the path matrix

a cell (i,j) is reached whose net evolutionary distance d(X(i),Y(j)) is

greater than the minimal evolutionary distance d(X,Y), the program ignores

all remaining entries in row i. The remaining cells in row i are dead and

trimmed out. As the program HOMOLOGY traverses the path matrix a second

time, the program sets up pointers on each cell visited:

(i-l,j ), (delete X(i)),

pointer(i,j) = (i-1,j-1), (substitute X(j) for X(i)) (2)(i,j-1 ), (insert Y(j)).

When we reach the last row of the path matrix, we can then 'backtrack' along

the pointers in (2) to find the optimal alignment(s) (See Figure 3).In backtracking, we can take a "high road" (an optimal alignment with a

maximum number of deletions among optimal alignments) called the upper

envelope, or we can take a "low road" (an optimal alignment with a maximum

number of insertions among optimal alignments) called the lower envelope.

As shown in Figure 3, these two envelopes represent two extreme forms of an

optimal alignment. When substantial similarity exists between two sequenc-

es, the upper and lower envelopes agree substantially along the two enve-

lopes. The process of trimming dead cells can be thought of as

approximating the upper and lower envelopes. The approximations to the

envelopes from knowing d(X,Y) can be improved.

The homology program first computes the evolutionary distance between

two protein or nucleic acid sequences and then computes the homology, if

requested. This distance value is useful for localizing regions of homology

between two larger sequences, and in testing hypotheses of evolutionary

relationships (see 12).*

ENVELOPE

After running the program HOMOLOGY, the program ENVELOPE automatically

248


FROM FILE UJVO.SEBCTGATATAA[CAGrCAAATCTGTATATATACCCA

FROM FILE LEXA.SEBCTGTATATACTCACAGCATAACTGTATATATATCCA

A tSSiTtUTION lb AT THE BEGINNINGTHE LENGTH OF THE LGNNT IS 36A SUMTITUTION 18 AT THE BEGINNINGTHE LEN3TH OF THE ALIONNET IS 36TIE ELAPE TIME IN SECONDS IS 4

ENVELOPESLOWER UPPER

SOURCE TARGET SOURCE TARGETC x C x C x cT X T x T X T6 X a X G X G

T x AA X A X A X AT x T x T X TA X A X A X AT X T x T X TA X A X A X AA C X A LT X T x T X Tc x C x cA X A X A

C X C X CA X A X A

O X G X O X 0C X C x C X CA X A X A X AA T X A TA X A X A X AT A X T AC X C X C X CT X T x T X TG X G X 6 k GT x T x T X TA X A x A X AT X T X T X TA X A X A X AT X T x T X TA X A X A X AT X T x T X TA X A X A X AC T x C TC x C x c x cC x c x C x CA X A x AX A

Figure 4. Sample output of the program ENVELOPE. The sequences are thesame two used in Figure 3. Displayed in this output are: (1) matchesbetween two sequences, indicated by an "X" between the sequences, and (2)two possible optimal alignments, indicated by an "X" between the pairs ofsequences. This output demonstrates the presence of a lexA repressorbinding site in the uvrD gene (11).

takes the stored output of HOMOLOGY and prints the upper and lower envelopes

through the path matrix. The program ENVELOPE backtracks along the pointerswhich HOMOLOGY sets up in the path matrix. The pointers guide ENVELOPE

along two transverse paths (the envelopes) from the lower right corner to

the upper left corner. The two envelopes represent two extreme forms of the

possible alignments. An example fo the optimal alignments is shown in

Figure 4, using the 5' flanking regions of the uvrD and lexA genes of E.

coli (11). Examples of homology between longer DNA sequences are shown in

Eckenrode et al. (12). As shown in Figure 4, the envelopes are listed with

249


I

U.EIUUmE asEU mmmumm on mUm.nwXu * umE

U*:.U Ua m m .

;-U

Figure 5. Output of the program DOTMATRIX. Upper X-axis represents thesource sequence (E. coli uvrD gene, 5-prime region), and the Y-axis repre-sents the target sequence (E. coli lexA gene, 5-prime region). The align-ment is presented with a fil-ter of 2 (see text).

X's on either side to indicate matches, and X's down the middle to indicatethe coincidence of the upper and lower envelopes. Whenever there are X's in

the middle column, the optimal alignment of the sequence X and Y is uniquelyspecified.

The printout of the upper and lower envelopes allows the user to definethe stringency of a match (i.e. the presence or absence of an 'X') in terms

of the weight matrix w. The user may define a match if and only if w(X(i),-Y(j)) = O, or less stringently, if and only if w(X(i),Y(j)) < C, a positiveconstant.

Using the Fortran-77 compiler, sequences of up to 40 kb in length maybe compared in a reasonable period of time.

DOTMATRIX

Program DOTMATRIX utilizes the AED 512 color graphics terminal to

generate a graphic display of the homology between two formatted sequencefiles. The program can be used to find homology between two sequences,

repeats within a single sequence, inverted repeats within a sequence, and

other structures within a sequence, or shared between sequences. One

sequence serves as row labels (X-axis) and the other sequence as columnlabels (Y-axisj. As in the HOMOLOGY program, weights are assigned to pairsof letters. A match between a pair of letters is defined by their weightbeing less than some user-selected threshold, usually 0. If a letter in theith row (ith position of first sequence) and jth row (jth position of second

sequence), then a dot appears on the screen. The program generates the dot

250


matrix by comparing every letter in the first sequence with every letter in

the second sequence.

The program permits the user to "filter" the dot matrix by allowing

them to select the length of matches required for a dot to be placed on the

screen. The alignment of the 5' flanking sequences of the uvrD and lexA

genes, with a filter of 2 is shown in Figure 5.

MARKOV

The current data-base of DNA sequence information, representing diverse

phylogenetic origins and a variety of gene functions, is being explored by

statistical analyses (7,30,38,52,53,58). The program MARKOV fits a Markov

chain model (5) to a formatted sequence file. The symbol X(i) will denote a

letter in the ith position of the sequence X = (X(1),X(2),...).

A special case of the Markov Chain Model is the random sequence model,

a frequently used starting point for hypothesis tests of homology (13,14,

15,31,42,43). The random sequence hypothesis assumes: Ci) the bases

X(1),X(2),... at different positions 1,2,... are mutually independent and

(ii) the frequency of letters (e.g. base composition) along the sequence is

homogenous. This model has been demonstrated to frequently fail (unpub-

lished results using genes randomly chosen from GENBANK). An alternative

model, the Markov Chain Model, allows local dependencies along the sequence

(7,37).A first order Markov chain model makes two assumptions. First, the

conditional probability of a letter, X(n), at position n, given all letters

at previous positions n-1,n-2,..., written pr(X(n)iX(n-l),X(n-2),...), only

depends on the letter at position n-i:

pr(X(n)iX(n-1),X(n-2),...) = pr(X(n)iX(n-1)).

The parameters in the first order Markov Chain are the table of conditional

probabilities pr(X(n)IX(n-l)), which are referred to as the transition

frequencies or transition probabilities. Secondly, the Markov Chain model

assumes that the transition frequencies are homogeneous along the sequence.

Counting the frequency of overlapping pairs of letters (i.e. dinucleotides,within a nucleic acid sequence) and dividing by their total number yieldsestimates of the transition frequencies. The transition frequencies intro-

duce a local one step dependency between letters in sequence.

The program MARKOV first tests the random sequence model's homogeneityand independence assumptions. If the random sequence model fails, then the

program fits a first order Markov chain by the method of maximum likelihood.

251


The program tests the homogeneity of the chain and the one step dependency

assumption. Other tests of the chaini, such as those based on runs of

letters, are made as well.

Sometimes there are more long range contextual constraints on a se-

quence due to codon preference and/or the amino acid composition of the

polypeptide being coded for (30,52,53). The program allows the possibility

of fitting a 2nd, 3rd, or 4th order Markov chain in which the current letter

X(n) depends on the previous 2, 3, or 4 letters, respectively. The range of

applicability and the uses of these models will be discussed elsewhere (38).

DISCUSSION

Gingeras (17) has suggested that there are two goals to most sequence

packages, relating different genetic data (restriction maps, other sequence

data, cytological maps of chromosomes) to the particular sequence data of

the user and the detection of biological signals in sequences (e.g. promoter

sequences, ribosome binding sites, etc.) Computers in molecular biology are

now being adapted to a broader class of applications (17,20): (i) experi-

mental design (3, 4, 16, 57); (ii) data collection (2, 19, 22, 25, 27, 50);

(iii) data cleaning (33); (iv) data base management (54, see also 44, 45,

46); (v) exploratory analysis (44, 46, 57); (vi) modeling (30, 31, 52, 53,

57). Three areas deserve particular attention. One major advance in the

organization of a sequence data base will be the creation of a system with

dynamic memory organization (26) to allow more rapid access to GENBANK. Two

such systems have already implemented for a library and in modeling human

memory (26). Two, graphics (55) should be an integral part of any analysis,

and the user should be able to invoke graphics primitives (35, 51, 23, 21)

without fear of disrupting an ongoing analysis. The graphics should be

directly modifiable. Three, much effort has been placed in the development

of good statistical packages (6, 39, 56), and this effort should be capital-

ized on in the development of statistical packages for modeling sequence

data.

ACKNOWLEDGEMENTS

We wish to thank R. Ivarie, R. Geever, B. Tyler, A. MacRae for comments

on the package. This work is supported by the U.S. Army Research Office

under Training Grant DAAG29-83-G-0111.

'Current address: Department of Genetics, North Carolina State University, Raleigh, North Carolina27695. USA

252


REFERENCES1. Anscombe, F.J. (1982) Computing in Statistical Science through APL,

Springer-Verlag, New York.2. Arnold, J. and Kankel, D.R. (1983) in Computer Science and Statistics:

Proceedings of the 14th Symposium on the Interface, Heiner, K.W.,Sacher, R.S., Wilkinson, J.W. (eds.), Springer-Verlag, New York.

3. Bach, R., Friedland, P., Brutlag, D.L., and Kedes, L. (1982) Nucl.Acids Res. 10:295-304.

4. Bach, R, Iwasaki, Y., and Friedland, P. (1984) Nucl. Acids Res.12:11-29.

5. Basawa, I.V., and B.L.S. Prakasa Rao (1980) in Statistical Inferencefor Stochastic Processes. Ch. 4. Discrete Markov Chains. AcademicPress, New York.

6. Billard, L. (ed.) Computer Science and Statistics: Proceedings of the16th Symposium on the Interface, Elsevier, Amsterdam, Holland, inpress.

7. Blaisdell, B.E. (1983) J. Mol. Evol. 19:122-133.8. Devereux, J., Haeberli, P., and Smithies, 0. (1984) Nucl. Acids Res.

12:387-395.9. Digital Equipment Corporation EDT Editor Manual (1984), Document

AA-J726A-TC, Maynard, Mass.10. Digital Equipment Corporation Utilities Manual (1984), Document

AA-1681A-TC.11. Easton, A.M., S.R. Kushner (1983) Nucl. Acids Res. 11:8625-8640.12. Eckenrode, V.K., Meagher, R.B., and J. Arnold (1984) J. Mol. Evol.

21:259-269.13. Elleman, T.C. (1978) J. Mol. Evol. 11:143-161.14. Erickson, B.W. and P.H. Sellers in Ch. 2 of Time Warps, String Edits,

and Macromolecules: The Theory and Practice of Sequence Comparison,Addison-Wesley, Reading, Mass.

15. Fickett, J.W. (1984) Nucl. Acids Res. 12:175-192.16. Friedland, P., Kedes, L., Brutlag, D.L., Iwasaki, Y., Bach, R. (1982)

Nucl. Acids Res. 10:323-340.17. Gingeras, T.R. (1983) in Statistical Analysis of DNA Sequence Data,

B.S. Weir (ed.), Dekker, New York.18. Gingeras, T.R., Milazzo, J.P., Sciaky, D., and Roberts, R.J. (1979)

Nucl. Acids Res. 7:529-545.19. Gingeras, T.R., Rice, P., and Roberts, R.J. (1982) Nucl. Acids Res.

10:103-114.20. Gingeras, T.R. and R.J. Roberts (1980) Science 209, 1322-1328.21. Goad, W.B. and Kanehisa, M.I. (1982) Nucl Acids Res. 10:247-263.22. Gray, A.J., Beecher, D.E., and Olson, M.V. (1984) Nucl. Acids Res.

12:473-492.23. Hamon, E. and Ruskin, J. (1983) J. Biol. Chem. 258:1318-1327.24. Keller, C., Corcoran, M., and Roberts, R.J. (1984) Nucl. Acids Res.

12:379-286.25. Kieser, T. (1984) Nucl. Acids Res. 12:679-688.26. Kolodner, J. (1984) Retrieval and Organizational Strategies in Concep-

tual Memory, Earl Baum, New York.27. Komaromy, M. and Govan, H. (1984) Nucl. Acids Res. 12:675-678.28. Kruskal, J.B. (1983) in Ch. 1 of Time Warps, String Edits, and Macro-

molecules: The Theory and Practice of Sequence Comparison, Addison-Wesley, Reading, Mass.

29. Lewin, B. (1983) Genes, Wiley, New York.30. Lipman, D.J. and Wilbur, W.J. (1983) J. Mol. Biol. 163:363-376.31. Lipman, D.J., Wilbur, W.J., Smith, T.F., and Waterman, M.S. (1984)

Nucl. Acids Res. 12:215-226.

253


32. Maxam, A.M. and W. Gilbert (1977) PNAS (USA) 74:560-564.33. Naus, J.I. (1975) Data Quality Control and Editing, Dekker, New York.34. Needleman, S.B. and C.D. Wunsch (1970) J. Mol. Biol. 48:443-453.35. Novotny, J. (1982) Nucl. Acids Res. 10:127-132.36. Nucleotide Sequences 1984 (1984) Parts 1 and 2,. IRL Press, Washington

D.C.37. Nussinov, R. (1981) J. Mol. Evol. 17:237-244.38. Phillips, G.J. and J. Arnold (1985), in preparation.39. Polhemus, N.W. (1983) in Computer Science and Statistics: Proceedings

of the 14th Symposium on the Interface, Heiner, K.W., Sacher, R.S.,Wilkinson, J.W. (eds.), Springer-Verlag, New York.

40. Queen, C.L., and Korn, L.J. (1980) Methods in Enzymol. 595-609.41. Sanger, F., Nicklen, S., and A.R. Coulson (1977) PNAS (USA)

74:5463-5467.42. Sankoff, D. and Cedergren, R.J. (1973) J. Mol. Biol. 77:159-164.43. Sege, R.D. and Saxberg, B.E.H. (1982) Nucl. Acids Res. 10:375-389.44. Soll, D. and Roberts, R.J., eds., (1982) The Applications of Computers

to Research on Nucleic Acids, IRL Press, Washington, D.C.46. Soll, D. and Roberts, R.J., eds., (1984) The Applications of Computers

tQ Research on Nucleic Acids Part 2, IRL Press, Washington, D.C.47. Staden, R. (1977) Nucl. Acids Res. 4:4037-4051.48. Staden, R. (1978) Nucl. Acids Res. 5:1013-1015.49. Staden, R. (1980) Nucl. Acids Res. 8:817-825.50. Staden, R. (1984) Nucl. Acids Res. 12:499-503.51. Staden, R. (1984) Nucl. Acids Res. 12:521-538.52. Staden, R. (1984) Nucl. Acids Res. 12:551-567.53. Staden, R., and Mclachlan, A.D. (1982) Nucl. Acids. Res. 10:141-156.54. Stockwell, P.A., Nucl. Acids Res. 10:115-125.55. Tufte, E.R. (1983) The Visual Display of Quantitative Information,

Graphics Press, Cheshire, CT.56. Tukey, J.W. (1983) in Computer Science and Statistics: Proceedings of

the 14th Symposium on the Interface, Heiner, K.W., Sacher, R.S.,Wilkinson, J.W. (eds.), Springer-Verlag, New York.

57. Weiner, P.K., Langridge, R., Blaney, J.M., Schaefer, R., and Kollman,P.A. (1982) Proc. Natl. Acad. Sci. (USA) 79:3754-3758.

58. Weir, B.S. (ed.) (1983) Statistical Analysis of DNA Sequence Data,Dekker, New York.

254

a comprehensive package for dna-sequence analysis in fortran-iv for the pdp-11

Documents