alscript: a tool to format multiple sequence alignments...for example, the unix 'troff' macros 'tbl'...
TRANSCRIPT
-
Protein Engineering vol.6 no.l pp.37-40, 1993
ALSCRIPT: a tool to format multiple sequence alignments
Geoffrey J.Barton
University of Oxford, Laboratory of Molecular Biophysics, The RexRichards Building, South Parks Road, Oxford OXl 3QU, UK
Introduction
Alignments of protein and nucleic acid sequences are a centralaid to understanding biological systems. Although severaleffective tools now exist for the rapid automatic alignment oflarge numbers of sequences (see, for example, Barton andSternberg, 1987; Feng and Doolittle, 1987; Higgins and Sharp,1989; Vingron and Argos, 1989), the preparation for publicationof quality figures of such alignments with annotations is oftenextremely difficult.
Artwork using pen/ink/dry lettering can allow labelling andboxing. However, this involves a great deal of labour and a fairdegree of skill to produce visually pleasing results [e.g. the large128 sequence globin alignment in Barton and Sternberg (1987)1.With the advent of powerful word-processing and graphicspackages for personal computers, many laboratories 'tidy up'their alignments by judicious mouse-work. Although good resultsmay be obtained by this route, the process, in common withconventional artwork, is rather inflexible. A particular problemis that having chosen a number ofcharacters per line and characterpoint size, it is difficult subsequently to modifz the figure. It isalso laborious to add new sequences to an existing figure. Inaddition, one is normally limited to the use of one of a few fixed-width fonts (e.g. Courier) rather than the full range of fontsavailable to the word-processor. The table-drawing facilities ofvarious packages can overcome the problems of proportionalfonts. For example, the Unix 'troff' macros 'tbl' were used toprepare the alignments in Barton and Sternberg (1990), thoughsubsequently retyped by the J. Mol. Biol.l L{leX (Lamport,1986) tables were used for the 88 sequence annexin alignmentin Barton et al. (1991) and the 67 sequence SH2 domainalignment in Russell et al. (1992): both figures were extremelytime consuming to prepare, but rather easier to update than aconventionally word-processed alignment.
The ALSCRIPT program described in this article wasdeveloped specifically to allow the easy formatting and graphicaldisplay of large multiple sequence alignments. Although writtenoriginally for the author's use, the interface is relatively friendly,and should be easy to learn by anyone familiar with plottinggraphs.
Description of ALSCRIPT
ALSCRIPT takes a multiple sequence alignment in the simpleAMPS @arton and Sternberg,1987; Barton, 1990) block-frleformat and a set of formatting commands, and produces aPostScript file that may be printed on a PostScript laser printeror viewed using a PostScript previewer (e.g. Sun Microsystem'sPageView program). GCG 'MSF' format files and CLUSTALformat files (PIR) are also supported. ALSCRIPT is strictly aformatting, display and annotation tool. It is not a program for
multiple sequence alignment, or editing. The previous programsthat offer the closest functionality to ALSCRIPT arePRETTYPLOT and PRETTYBOX as supplied with the GCGpackage (Devereux et al., 1984), and the latest colour versionof SOMAP (Parry-Smith and Attwood, 1991). Whilst theseprograms do not provide the same degree of flexibility in displayas offered by ALSCRIPT, they do allow the calculation ofconsensus sequences and automatic shading/boxing according todehned rules. The aim of ALSCRIPT is to allow the user totalcontrol over the display ofthe sequence alignment; such controlis essential if non-sequence information is to be used to highlightfeatures of the sequence: for example, the location of active-siteresidues, the positions of secondary structures (cy-helices and p-strands), and domain or intron/exon boundaries. The flexibilityof ALSCRIPT also permits it to be used as a 'front-end' displaytool for programs that offer sophisticated alignment analysisfeatures. For example, the AMAS (Analysis of Multiply AlignedSequences; C.D. Livingstone and G.J.Barton, in preparation)system, which highlights structurally important regions of amultiple alignment, generates ALSCRIPT commands as oneoutput option. Similarly, the consensus algorithms encoded inthe GCG program PRETTY could be used to generateALSCRIPT shading, boxing or font-changing commands forgraphical output.
Given a block-file and the text point size, AIJCRIPT calculateshow many residues can be fitted across the page and how manysequences will fit down it. It then prints the alignment at thechosen point size on as many pages as are needed. RunningALSCRIPI with a smaller or larger point size will automaticallyrescale the alignment to f,it on fewer or more pages, asappropriate. The actual page dimensions may be reset to anyvalue, so if an ,A.3 PostScript printer or fypesetting machine isavailable, alignments can readily be scaled to make best use ofthe extra space.
Each output page has three regions. The left hand edge containsidentifuing text for each sequence, the main part of the page holdsthe alignment, and the top part, the position numbers and optionaltick marks. ALSCRIPT commands make use of a charactercoordinate system for font changes and other formattingcommands. Thus, any residue in the alignment may be referredto by its sequence position number (x-axis) and sequence number(y-axis). Ranges of residue positions or sequences may also bedef,rned in the character coordinate system.
The basic ALSCRIPI commands allow the followinefunctionalities.
FontsAny PostScript font in any size may be def,rned and used onindividual residues, regions or identifrer codes.
Boxing
Simple rectangular boxes may be drawn around any part of thealignment. Particular residue types may be selected andautomatically 'surrounded' by lines. For example, if thecharacters 'G' and 'P' are selected, lines will not be drawn
5 t
-
G.J.Barton
a FFFLLFFFTFFFFFFFFFIJIJDDDDDDDDDDDDDDDDDD7
EEEEEEDDDEEEEEDDEEEGRRRRRRAOOOOOAAOAAAKKKT T TRRRRRKRRKKRRRREDDAAAVAADDDDDD I TRLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL+
QEEYYYDDDKKKKKKKYYRQDDDEEERQOEEEEEAXDQAAAAASSSSSSSAASSW?
GGGGGGGGGGGGGGGG@MMMMMMMMMUMMMMMU IJLWVEEEEEEEEEEEEEEEKKKKKKKKKKKKKKKKKKKKKKKRRRLLLLLKKKKNKGffiRRRRRRKKKKKKKKIRKRVAAAAAKKKIJI'I'L I'AAAA I AKKKKKKWIIT{WVIWVTVIRI{YIL
+TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT+RRRDDDRRRDDDRRDDRDDDDDDDDDDDDDDDDDDDDDIIHHDDDDDDDDDDDEEDDEEWvvvvEEEEEEEEEEEEKKKHHDKKKNDNNNKKGS FFP P PNNNEEEAAWVI'IDD SAATTTTWVTTTTTTTTTTTKKKVVUIKKQQKKKKKKCLLLLLLLLLLLLLLLLLLLLWWWFFFFFFFFFFFFFFF 6I T I I I I I I I I I I I ITT IVIiINI I INTNI I I I ILLLITTNsWSSSTTTTTTYYTTTTEEMII IVWI I I\AA/II I\NI TII I I IIII I II IVVVI I I I5
VMMMMMLLLLFI,I,LLLLLLLL 6WTTTTTGGGGGCCCCCCA6sssssssssssssssssrssEEETsrrrrNNsssrLLT
+
HHHITHHHHHHHIIHHHHHHTIHHHH
HHI{HITHHHHHHHHHHHHITHHHITHHHHHHHHHHHHHH
HHHHHHIIHH
HHHHHHBBHHHHHHH
IIITHHHH
HHHHHH
TTITHHHITHHHHHHH
HHITHHHHHHHHHHHHHTIITHHHHII
HTITTHHHIIHITHHHHH
HHHHHHHHHHHH
HTTHHHHITITHHIIHH
E TTTTHTTTH
TTHE TTTHE TTTHE TTHE TTHE TTH
TE TTTHE TTTTH
TTTTTTTTTT
TTTTTTTTTTTTTTTTTTTTTTTTTTT
E TTTTTTTTTTE TTTTTTTY
EEE TTTTTHEEEE TTHEEEE TH
EEEEE HEEEEE HEEEEE TH
EEEEEEEE
TTTTHTTTTTT
TTTTTTTTTT
b# (A) Input ud output flcs:
BLOCK_FILE examp1el.blocOUTPUT_FILE examplel.ps
# (B) Pegc Lryout rnd ovcrrll epeclng:
ADD_SEQ 38 1ADD_SEQ 39 3ADD-SEQ 69 1PORTF"AITPOINTSIZE 10
# (C) Font dc0nitiong:
DEFINE-FONT ODEFINE_FONT 1DEFINE-FONT 3DEFINE-FONT 4DEFINE_FONT 5DEFIM-FONT 6SETI'P
Helvetica DEFAULTHelvetica REL 0. ?5Helvetica-Bold DEFAIILTTimes-Bold DEFAITLTHelvetica-BoldObIigue DEFAULTTimes-Ronan DEFAITLT
* @) Brsic formrtting.nd rnnotrtion commmdc:
IDENT_FONT AJ,L 6SURROI'ND-CITARS LIV ALLFONT-CITARS IV AI,L 5FONT-CIIARS ASN ATL 1FONT-CIIARS KR ALL 4BOX_REGTON 14 1 16 38SIIADE-CHARS + AIJL O. OINVERSE-CHARS DE AI,I,SIIADE-CITARS D AIL 0.5SHADE-CHARS E AJ,I, O.OSURROI'ND_CHARS 56+?89 t 40 2'l 40LINE TOP 2 40 8LINE BOTTOM 2 40 8TEXT 2 42 "Helix PatterntrTEXT 12 42 nGlycine LoopnTEXT 2L 42 "Helix PatternnSI]B_ID 49 "HeLix Predictions"SUB_ID 58 nSheet Predict ionsnSIIB_ID 68 nTurn Predictions"
# @) Now improvc thc eppcrencc of thc prcdlction histogrems.
SURROUND-CIIARS IISIIADE-CITARS ITSI'B_CHARS 1
L 4 4 2 7L 4 4 2 7
44 2't 53
535 3 0 . 5
TT SPACE
Fig. 1. Illustration of some of the formatting capabilities of ALSCRIPT. (a) A small AMPS block file. Each aligned sequence is represented by a column.This block file includes character-based histograms that show the result of a combined secondary structure prediction method (Russell et al., 1992). (b) AnALSCRIPI command file. Comment lines start with a # character, sections A-C set up general layout definitions: the ADD-SEQ command allows extraspace to be inserted in the alignment at any location; 39 3 means add space for three sequences after sequence 39. PORTRAIT specifies that the output willbe arranged with the longest side of the paper vertical. POINT SIZE sets the default point size for plotting characters. The DEFINE-FONT commands setup the fonts that are to be used and their point size; font 0 is used as the default. The point size may be set to the default value, an explicit value or relative(REL) to the default size. Sections D and E show the boxing, shading and font changing commands in action. The IDENT-FONT ALL 6 specifies font 6(Times-Roman) for all identifiers. SURROUND-CHARS LIV ALL specifres that all L, I and V characters will be separated from other characters by a line(e.g. at sequence positions 19-21). FONT-CHARS KR ALL 4 switches the font of K and R characters to font 4 (Times-Bold l0 point). When regions areselected the coordinates are given in character units with the origin at the top left hand corner of the plotted alignment. For example, BOX___REGION14 I 16 38 means box from residue 14 to residue 16 of sequences I -38. These coordinates apply irrespective of how many pages are needed to plot thealignment at the chosen point size. The SHADE-CHARS command permits specific characters to be shaded with a grey value; The INVERSE-CHARScommand allows white lettering to be superimposed on the shaded background, as shown for D and E. LINE TOP 2 40 8 draws a horizontal line at the topof the characters from residue 2 to residue 8 of sequence 40, and similar commands allow lines at the left, right and below the defined residues; SUBJDchanges the identifier code for a sequence and SUB-CHARS allows a character to be substituted with another. In the example shown here (c),SUB-CHARS is used to remove the H, E and T characters that made up the prediction histograms and replace them with SPACE ' ' characters. The similarcommands used to format the Sheet, Turn and Summary predictions are omitted for brevity.
38
-
ALSCRIPT
H
FF
HH
rFF
Y
Y
Q e
ElY AY AY sIffDl*Isl&EsK sK sK sK AK ^K SK sY nY AR AQ e
HrR ^Q rQ rEet:lA
EIAEIA AK ^EIeQ e
Annexin I H4Annexin I M4Annexin I R4Annexin V R4Annexin V H4Annexin V C4Annexin II H4Annexin II 84Annexin II M4Annexin IV P4Annexin IV 84Annexin IV H4Annexin VI H4Annexin VI M4Annexin VI H8Annexin VI M8AnnexinV BH4Annexin VII4Annexin III H4Annexin III R4Annexin II lvlilAnnexin II H3Annexin II 83Annexin I H3Annexin I M3Annexin I R3Annexin V C3Annexin V H3Annexin V R3Annexin VI H3Annexin VI M3Annexin IV H3Annexin IV 83Annexin IV P3Annexin V BH3Annexin III H3Annexin III R3Annexin VII 3
Conservation
G RrER ^R ^R AO EQ r
EHQ TO R
I
FFl ' AF rF eIL IAIL IAF rF rF AF AF eF eF rF eF eF AF rF rf a
AAA
AAAA
AAAA
AA
AvlAAAAAA
KKKTTTRRRRRKRRKKRRR
GGGGGGG\:GlJ
GGUI
GGGGG
FT-.r-TNHelix Pattern
Helix hedictions
Sheet Predictions
@ $. t-.. ..ffiI ffi*ql
Tum Predictions
Summaryhediction ffi ffi
between G and P characters. but onlv where G and P borderwith other characters.
Shading
Grey shading of any level from black to white may be appliedto any region of the alignment, either as a rectangular region
IGlycine Loop
16Tl ls 6--6-l fHelix Pattern
.ffi* #:i:::l:;i:li::,:::i:ii::l:::,i:ii::li:it:i::: | |iiiiiiiiililltiililliltli,il #:lll:ilitiillllitffi.,#iiilfi i:
or as residue-specific shading, e.g. to 'shade all Cys residuesbetween positions 6 and 30'.
Text
Specific text strings may be added to the alignment at any positionand in any font or font size.
l,iox clM K GM K GM K GM K GM K GM K GM K GM K GM K c lM K G IM K G IM K G IM K G IM K GM K GM K GM K Glllx clI L I K Gl v l K R
ffi
RRRRRRRRRRRRRRRRRRRR
RRRRRRRRRRRRRRR
TTTTTTTTTKK
sTTTTNNsssIu I
t L lT
FFFFFFFFFFFF
39
-
G.J.Barton
LinesHorizontal or vertical lines may be drawn to the left, right, topor bottom of any residue position or group of positions.
DefaultsAll defaults, e.g maximum numberAength of sequences, characterspacing, may be modified using ALSCRIPT commands withoutthe need to recompile the program.
Figure l(a) illusrates a small section of a multiple sequencealignment in AMPS block-file format. The sequence identifrercodes stored above the alignment have been deleted for brevity.In this example, the block-file also contains character-basedhistograms representing the prediction of helix, strand and turn.Figure 1(c) illustrates the result of running ALSCRIPT on thisfile using the commands shown in Figure 1O). It is not suggestedthat this combination of options gives the clearest representationof the data; rather, the options have been chosen to illustrate manyof the capabilities of the program.
Although written with the aim of producing figures forpublication, ALSCRIPT is a useful research tool for interpretingmultiple sequence alignments. For example, the boxing, shadingand font changing facilities can be applied to highlight aminoacids of a particular fype and thus draw attention to clusters ofpositive or negative charge, hydrophobicity and so on.Furthermore, computer programs for the automatic analysis ofalignments can be made to produce ALSCRIPT formattingcomnxrnds and a block file, thus simpliffing the task of generatinggraphical representations of such analyses.
System requirements and availability
ALSCRIPT is written in C and should compile and run on mostcomputers with a C compiler. An IBM-compatible disk (1.214MB) including the source code and a compiled version of theprogram for 386 DOS computers is available from the author.
Acknowledgements
The author thanks the Royal Society for support, and Professor L.N.Johnsonfor providing a stimulating working environment.
References
Barton,G.J. (199O) Methods Enzymol., 183, 403-428.Barton,G.J. and Sternberg,M.J.E. (1987) l. Mol. Biol.. lW, 327 -337.
Barton,G.J. and Sternberg,M.J.E. (1990) l. Mol. Biol.,212,389-4U.Barton,G.J., Newman,R.H., Freemont,P.F. and Crumpton,M.J. (1991\ Eur. J.
Biochem.. l9rl. 749-7ffiDevereux,J., Haeberli,P. and Smithies,O. (1984) Nucleic Acids Res., 12,
387 -395.
Feng,D.F. and Doolittle,R.F. (1987) J. Mol. Evol.,25,351-360.Higgins,D.G. and Sharp,P.M. (1989) Comput. Appl. Biosci.,5, 151-153.Lamport,L. (1986) IaTeX: A Document Preparation Sysrem. Addison-Wesley,
New York.Parry-Smith,D. J. and Attwood,T. K. ( 199 I ) Comput. App l. Bios ci., 7, 233 - 235.Russell,R.B., Breed,J. and Barton,G.J. (1992) FEBS Lett.,3{M., l5-2O.Vingron,M. and Argos,P. (1989) Comput. Appl. Biosci.,5, 115-121.
Received on June 30, 1992: revised on October 14, 1992; accepted on October20. 1992
40