data mining of electrostatic interactions between …jkalita/work/studentresearch/deshmukhms... ·...

University of Colorado at Colorado Springs i

DATA MINING OF ELECTROSTATIC INTERACTIONS

BETWEEN

AMINO ACIDS IN COILED-COIL PROTEINS

USING

THE STABLE COIL ALGORITHM

BY

ANKUR S. DESHMUKH

A project submitted to the Faculty of Graduate School of the

University of Colorado at Colorado Springs

in partial fulfillment of the

requirements for the degree of

Master of Science

Department Computer Science

2008

Ankur S. Deshmukh

University of Colorado at Colorado Springs iiThis project for the Masters of Science Degree

by

Ankur Deshmukh

has been approved for the

Department of Computer Science by

Approved by Date

__________________________________Advisor: Dr. Jugal Kalita

__________________________________Committee member: Dr. Edward Chow

__________________________________Committee member: Dr. Robert Hodges

DATE__________

Ankur S. Deshmukh

University of Colorado at Colorado Springs iii

TABLE OF CONTENTS

CHAPTER 1................................................................................................................................................1

INTRODUCTION....................................................................................................................................1

CHAPTER 2................................................................................................................................................3

BACKGROUND RESEARCH................................................................................................................3

2.1 BACKGROUND RESEARCH IN UNDERSTANDING COILED-COILS.................................3

2.1.1 UNDERSTANDING PROTEIN STRUCTURE...................................................................................4

2.1.1.1. PRIMARY STRUCTURE.............................................................................................................4

2.1.1.2. SECONDARY STRUCTURE......................................................................................................8

2.1.1.3. TERTIARY STRUCTURE.........................................................................................................10

2.1.1.4. QUATERNARY STRUCTURE.................................................................................................10

2.1.2 COILED-COILS..................................................................................................................................12

2.2 BACKGROUND RESEARCH IN UNDERSTANDING COILED-COIL PREDICTION

ALGORITHMS..................................................................................................................................15

2.2.1. COILS ALGORITHM........................................................................................................................15

2.2.2. PAIRCOILS ALGORITHM...............................................................................................................16

2.2.3. SOCKET ALGORITHM....................................................................................................................16

2.2.4. 2ZIP ALGORITHM - IDENTIFYING LEUCINE ZIPPERS............................................................17

2.2.5. STABLE INPUT ALGORITHM........................................................................................................17

CHAPTER 3..............................................................................................................................................19

STABLE COIL ALGORITHM..............................................................................................................19

3.1 STABLE COIL ALGORITHM: PART I.....................................................................................21

3.2 STABLE COIL ALGORITHM: PART II....................................................................................22

3.3 CLUSTER PATTERNS IN COILED-COILS.............................................................................25

Ankur S. Deshmukh

University of Colorado at Colorado Springs ivCHAPTER 4..............................................................................................................................................26

PROJECT ARCHITECTURE................................................................................................................26

4.1 DATABASE ARCHITECTURE.................................................................................................27

4.1.1 STRUCTURE AND CONTENT OF THE TABLES INVOLVED IN PHASE 1..........................28

4.1.1.1 PROTEIN TABLE.......................................................................................................................29

4.1.1.2 COILED-COIL TABLE...............................................................................................................30

4.1.1.3 PROTEIN COIL TABLE.............................................................................................................32

4.1.2 STRUCTURE AND CONTENT OF THE TABLES INVOLVED IN PHASE 2..............................33

4.1.2.1 SALT RESIDUES LOOKUP TABLE.........................................................................................34

4.1.2.2 SALT BRIDGE TABLE..............................................................................................................35

4.1.2.3 COILED-COIL HEPTAD TABLE..............................................................................................36

4.1.2.4 HEPTAD SALT TABLE.............................................................................................................37

4.1.2.5 SCRAPE FILE TABLE................................................................................................................39

4.1.3 MATERIALIZED VIEWS........................................................................................................41

4.1.3.1 AMINO ACID OCCURRENCES................................................................................................41

4.1.3.2 COIL LENGTH VS. CLUSTER PER COIL...............................................................................42

4.2 PERL CODE DESIGN.................................................................................................................44

4.3 WEBSITE ARCHITECTURE.....................................................................................................48

4.3.1 INDEX PAGE......................................................................................................................................49

4.3.2 PROTEIN SEARCH PAGE................................................................................................................50

4.3.3 COILED-COIL SEARCH PAGE........................................................................................................52

4.3.4 COILED-COIL MOTIF SEARCHING...............................................................................................54

4.3.5 COIL HEPTAD AND SALT BRIDGE SEARCH..............................................................................55

4.3.6 GENERATED REPORTS...................................................................................................................57

CHAPTER 5..............................................................................................................................................60

RESULTS...............................................................................................................................................60

Ankur S. Deshmukh

University of Colorado at Colorado Springs v

CHAPTER 6..............................................................................................................................................79

CONCLUSION......................................................................................................................................79

CHAPTER 7..............................................................................................................................................80

REFERENCES.......................................................................................................................................80

APPENDIX A: CREATING MATERIALIZED VIEWS IN MYSQL.............................................................83

APPENDIX B: SQL QUERIES FOR CREATING MATERIALIZED VIEW.....................................86

APPENDIX C: INSTALLATION AND PERFORMANCE OF THE PROJECT................................91

Ankur S. Deshmukh

University of Colorado at Colorado Springs vi

LIST OF FIGURES

FIGURE 2-1: THE STRUCTURE OF PART OF A DNA DOUBLE HELIX. FIGURE OBTAINED FROM [23].............3

FIGURE 2-2: A GENERAL STRUCTURE OF Α-AMINO ACID, WITH THE AMINO GROUP ON THE LEFT AND THE

CARBOXYL GROUP ON THE RIGHT. FIGURE OBTAINED FROM [1]..........................................................5

FIGURE 2-3: A CONDENSATION REACTION BETWEEN TWO Α-AMINO ACIDS RESULTING IN A PEPTIDE BOND.

FIGURE OBTAINED FROM [1].................................................................................................................5

FIGURE 2-4: PHI AND PSI ANGLES................................................................................................................6

FIGURE 2-5: HYDROGEN BONDING BETWEEN AMINO ACIDS IN THE PROTEINS. FIGURE OBTAINED FROM

[25]........................................................................................................................................................7

FIGURE 2-6: A DEPICTION OF Α-HELIX, MOST COMMONLY OCCURRING PROTEIN STRUCTURE IN COILED-

COILS. FIGURE OBTAINED FROM [1]......................................................................................................8

FIGURE 2-7: A DEPICTION OF Β-SHEET, IN ANTI-PARALLEL AND PARALLEL FORMATION. FIGURE

OBTAINED FROM [2]..............................................................................................................................9

FIGURE 2-8: PROTEIN STRUCTURE, FROM PRIMARY TO QUATERNARY. FIGURE OBTAINED FROM [26]......11

FIGURE 2-9: CLASSIC EXAMPLE OF COILED-COIL GCN4 LEUCINE ZIPPER. FIGURE OBTAINED FROM [1]..12

FIGURE 2-10: POSITIONS OF AMINO ACIDS IN THE COILED-COIL. THE FIGURE HAS BEEN OBTAINED FROM

[3]........................................................................................................................................................13

FIGURE 2-11: CROSS-SECTIONAL VIEW OF A TWO-STRANDED COILED-COIL. HYDROPHOBIC AND

ELECTROSTATIC INTERACTIONS BETWEEN TWO STRANDED Α-HELICAL COILED-COILS FORMED BY

THE HOMODIMERIZATION OF 35-RESIDUE POLYPEPTIDE CHAINS. ADAPTED FROM [7].......................13

FIGURE 4-1: E-R DIAGRAM DETAILING THE RELATIONSHIP BETWEEN TBLPROTEIN AND TBLCOILEDCOIL28

FIGURE 4-2: STRUCTURE OF PROTEIN TABLE (TBLPROTEIN).....................................................................29

FIGURE 4-3: STRUCTURE OF COILED-COIL TABLE (TBLCOILEDCOIL)........................................................31

FIGURE 4-4: STRUCTURE OF PROTEIN COIL TABLE (TBLPROTEINCOIL)....................................................32

Ankur S. Deshmukh

University of Colorado at Colorado Springs viiFIGURE 4-5: E-R DIAGRAM DETAILING THE RELATIONSHIP BETWEEN TBLSALTBRIDGE AND

TBLSPLITHEPTADCOILS......................................................................................................................33

FIGURE 4-6: STRUCTURE OF SALT RESIDUES LOOKUP TABLE (TBLSALTRESIDUESLOOKUP)...................34

FIGURE 4-7: STRUCTURE OF SALT BRIDGE TABLE (TBLSALTBRIDGE)......................................................35

FIGURE 4-8: STRUCTURE OF COILED-COIL HEPTAD TABLE (TBLSPLITHEPTADCOIL)................................37

FIGURE 4-9: STRUCTURE OF HEPTAD SALT BRIDGE TABLE (TBLHEPTADSALT).......................................38

FIGURE 4-10: STRUCTURE OF SCRAPE FILE TABLE (TBLSCRAPEFILE)......................................................39

FIGURE 4-11: MATERIALIZED VIEW OF AMINO ACID OCCURRENCES

(MATVIEW_AMINOACIDOCURRENCES)..............................................................................................41

FIGURE 4-12: MATERIALIZED VIEW OF COILED-COIL LENGTH VS. THE CLUSTER COUNT

(MATVIEW_COILCLUSTERCOUNT)......................................................................................................42

FIGURE 4-13: PROCESS FLOW DIAGRAM FOR THE STABLE COIL ALGORITHM..........................................47

FIGURE 4-14: INDEX PAGE OF THE STABLE COIL WEBSITE........................................................................49

FIGURE 4-15: PROTEIN RELATED SEARCH PAGE........................................................................................50

FIGURE 4-16: COILED-COIL RELATED SEARCH PAGE.................................................................................52

FIGURE 4-17: COILED-COIL MOTIF SEARCH WEB PAGE.............................................................................54

FIGURE 4-18: COILED HEPTAD AND SALT BRIDGE SEARCH PAGE.............................................................55

FIGURE 5-1: COILED-COILS COUNT VS. COILED-COIL LENGTH..................................................................65

FIGURE 5-2: LOCATION OF OCCURRENCE OF AMINO ACID WITHIN THE COILED-COIL WHEN THE AMINO

ACID IS AT HEPTAD OFFSET A..............................................................................................................66

FIGURE 5-3: LOCATION OF OCCURRENCE OF AMINO ACID WITHIN THE COILED-COIL WHEN THE AMINO

ACID IS AT HEPTAD OFFSET D..............................................................................................................66

FIGURE 5-4: NORMALIZED VALUE OF DESTABILIZING CLUSTERS IN COILED-COILS OF PARTICULAR

LENGTH. RESULTS OBTAINED BY DIVIDING THE TOTAL NUMBER OF COILED-COILS WITH DE-

CLUSTERS BY THE TOTAL NUMBER OF DE-CLUSTERS..........................................................................67

FIGURE 5-5: NORMALIZED VALUE OF STABILIZING CLUSTERS IN COILED-COILS OF PARTICULAR LENGTH.

Ankur S. Deshmukh

University of Colorado at Colorado Springs viiiRESULTS OBTAINED BY DIVIDING THE TOTAL NUMBER OF COILED-COILS WITH CLUSTERS BY THE

TOTAL NUMBER OF CLUSTERS.............................................................................................................68

FIGURE 5-6: DISTRIBUTION OF DESTABILIZING CLUSTER OF LENGTH 3 WITH RESPECT TO THE COILED-

COIL LENGTH..............................................................................................................................................69


COIL LENGTH......................................................................................................................................69


COIL LENGTH......................................................................................................................................70


COIL LENGTH......................................................................................................................................70

FIGURE 5-10: DISTRIBUTION OF DESTABILIZING CLUSTER OF LENGTH 7+ WITH RESPECT TO THE COILED-

COIL LENGTH......................................................................................................................................71

FIGURE 5-11: DISTRIBUTION OF STABILIZING CLUSTER OF LENGTH 3 WITH RESPECT TO THE COILED-

COIL LENGTH......................................................................................................................................71


COIL LENGTH......................................................................................................................................72


COIL LENGTH......................................................................................................................................72


COIL LENGTH......................................................................................................................................73

FIGURE 5-15: DISTRIBUTION OF STABILIZING CLUSTER OF LENGTH 7+ WITH RESPECT TO THE COILED-

COIL LENGTH......................................................................................................................................73

FIGURE 5-16: RELATIONSHIP OF AMINO ACIDS IN OFFSET A TO AN I TO I + 3 SALT BRIDGE.....................75

FIGURE 5-17: RELATIONSHIP OF AMINO ACIDS IN OFFSET A TO AN I TO I + 4 SALT BRIDGE.....................75

FIGURE 5-18: RELATIONSHIP OF AMINO ACIDS IN OFFSET A TO AN I TO I’ + 5 SALT BRIDGE....................76

FIGURE 5-19: RELATIONSHIP OF AMINO ACIDS IN OFFSET D TO AN I TO I + 3 SALT BRIDGE.....................76

Ankur S. Deshmukh

University of Colorado at Colorado Springs ixFIGURE 5-20: RELATIONSHIP OF AMINO ACIDS IN OFFSET D TO AN I TO I + 4 SALT BRIDGE.....................77

FIGURE 5-21: RELATIONSHIP OF AMINO ACIDS IN OFFSET D TO AN I TO I’ + 5 SALT BRIDGE....................77

LIST OF TABLES

TABLE 2-1: TABLE OF STANDARD AMINO ACID ABBREVIATIONS AND SIDE CHAIN PROPERTIES..................7

TABLE 3-1: HELICAL PROPENSITY AND STABILITY VALUES OF THE 20 STANDARD AMINO ACIDS AT

VARIOUS POSITIONS IN THE HEPTAD...................................................................................................20

TABLE 3-2: COILED-COIL SEQUENCE STARTING AT OFFSET A....................................................................23

TABLE 3-3: COILED-COIL SEQUENCE STARTING AT OFFSET B....................................................................23

TABLE 3-4: AN AGGREGATION OF STABILITY VALUES 42 AMINO ACIDS AT A TIME..................................23

TABLE 3-5: DETERMINING THE PRESENCE OF A COILED-COIL IN THE PROTEIN SEQUENCE........................24

TABLE 3-6: DETERMINING THE PRESENCE OF A CLUSTER (STABILIZING OR DE-STABILIZING) IN THE

COILED-COIL SEQUENCE......................................................................................................................25

TABLE 4-1: LIST OF SALT BRIDGES WHICH PROVIDE I TO I + 3, I TO I + 4 AND I TO I’ + 5 ELECTROSTATIC

INTERACTIONS.....................................................................................................................................46

TABLE 4-2: SEARCH PARAMETERS USED ON THE PROTEIN RELATED SEARCH PAGE................................51

TABLE 4-3: SEARCH PARAMETERS USED ON THE COILED-COIL RELATED SEARCH PAGE.........................53

TABLE 4-4: SEARCH PARAMETERS USED ON THE COILED-COIL MOTIF SEARCH PAGE..............................54

TABLE 4-5: SEARCH PARAMETERS USED ON THE COIL HEPTAD AND SALT BRIDGE SEARCH PAGE..........56

TABLE 5-1: TOP 10 AMINO ACID PAIRS OCCURRING IN HEPTAD OFFSETS A AND D WHICH FORM THE

HYDROPHOBIC CORE............................................................................................................................60

TABLE 5-2: TOP 10 AMINO ACID PAIRS OCCURRING IN HEPTAD OFFSETS D AND E....................................60

TABLE 5-3: TOP 10 AMINO ACID PAIRS OCCURRING IN HEPTAD OFFSETS G AND E USUALLY ASSOCIATED

WITH ELECTROSTATIC ATTRACTION I TO I’ + 5..................................................................................61

TABLE 5-4: TOP 10 AMINO ACID PAIRS OCCURRING IN HEPTAD OFFSETS ‘E’ AND G USUALLY ASSOCIATED

Ankur S. Deshmukh

University of Colorado at Colorado Springs xWITH ELECTROSTATIC ATTRACTION I TO I’ + 2..................................................................................61

TABLE 5-5: TOP 10 AMINO ACID PAIRS OCCURRING IN HEPTAD OFFSETS G AND A...................................61

Ankur S. Deshmukh

University of Colorado at Colorado Springs xi

TABLE 5-6: TOP 30 AMINO ACID PAIR OCCURRENCES IN COILED-COILS..................................................63

Table 5-7: Top 30 frequently occurring amino acids in the Stable Coil Database.....................................64

Ankur S. Deshmukh

University of Colorado at Colorado Springs 1

Chapter 1

INTRODUCTION

The sequencing of the human genome, as well as the genomes of many other species, has introduced a wide array of research fields within the discipline of molecular biology. One of the fastest growing fields is Proteomics, the study of proteins, protein structures, and the functions these proteins perform. Various research facilities are dedicated to this field of study, including the Peptide Chemistry Lab of Dr. Robert Hodges at the University of Colorado Health Sciences Center (UCHSC). The primary focus of this group of researchers is to understand the factors that affect the stability of proteins in general and the coiled-coil oligomerization domain in particular. These factors include, but are not limited to, the hydrophobic and hydrophilic interactions and the intrachain and interchain electrostatic interactions between the amino acids present in these coiled-coils. The ability to determine coiled-coil stability will greatly facilitate the prediction of coiled-coils in protein structures and will advance protein design. Because coiled-coils are the most commonly occurring oligomerization domain in nature, understanding the interactions within them can advance the study of proteomics as a whole.

The protein data available today is not only voluminous but complex. In order to interpret results in a timely and inexpensive manner, it is necessary to create prediction algorithms, which act as precursors to the lab experiments. This project uses such an algorithm to explore two of the primary areas of interest being studied at UCHSC. First, this project uses an established prediction algorithm, the Stable Coil Algorithm, to determine the existence of coiled-coils programmatically, eliminating the need for any human intervention. The second part of the project revolves around finding out what kind of interactions occurs within coiled-coils. Researchers have proposed that hydrophobic and electrostatic interactions are the primary forces that abet in the stability of the coiled-coil; hence, it is necessary to find out all possible information about these forces. Important areas of study include efforts to determine how the location of an amino acid in the heptad sequence affects coiled-coil stability and which amino acids hinder or aid that stability. In order to accomplish these goals, this project presents researchers with tools to efficiently study coiled-coil stability. These tools revolve around a revamped Stable Coil database.

The first rendition of the Stable Coil database used the Stable Coil Algorithm to predict the presence of coiled-coils [4]. However, the database had become corrupted. Furthermore, the database did not recognize updates to the raw data available on the ExPASy1 server. These issues, combined with query performance and the absence of error logging, made it necessary for this project to recreate the Stable Coil database as well as the Perl programs involved in data collection. The resulting database allows researchers to designate new sources of raw data for collection; the associated Perl programs then process the new data automatically. In addition to redesigning the original database, this project provides additional tables to facilitate research into the various factors affecting coiled-coil stability. This database is freely available via the Stable Coil web interface at http://simbio.uchsc.edu/StableCoil.

This website provides users with three basic search functionalities that can help them understand the various aspects of coiled-coil formations within proteins and the electrostatic interactions within those coiled-coils. A fourth functionality allows for complex motif searching within coiled-coils, thus providing users with information on the types and frequency of amino acid residues occurring within coiled-coils. In addition, the website presents users with fourteen unique reports, each of which provide a different insight

1 The ExPASy (Expert Protein Analysis System) proteomics server of the Swiss Institute of Bioinformatics (SIB), located at http://expasy.org/ is dedicated to the analysis of protein sequences and structures.

Ankur S. Deshmukh

http://simbio.uchsc.edu/StableCoil


into the realm of coiled-coils. These reports return results ranging from cluster distribution across coiled-coil lengths to lists of amino acids frequently found in and around coiled-coils. We hope this project will provide researchers with an efficient way to determine the presence of coiled-coils in proteins and act as a learning tool to help users understand the varying complexities of coiled-coil structures in protein.

Ankur S. Deshmukh


Chapter 2

BACKGROUND RESEARCH

2.1 BACKGROUND RESEARCH IN UNDERSTANDING COILED-COILS

Deoxyribonucleic acid (DNA) contains the genetic instructions used in the development and functioning of all known living organisms. As the main role of DNA is long-term storage of information, it is often likened to a blue print repository that is used to construct cell components such as proteins and RNA molecules. The DNA is a double helix consisting of two long polymers of simple units called nucleotides, with backbones made of sugars and phosphate groups joined by ester bonds. These two strands run in opposite directions to each other and therefore called anti-parallel. Attached to each sugar is one of the four molecules known as the bases, which encode the genetic information. This information is interpreted using the genetic code, which specifies the sequence of amino acids within a protein sequence.

Figure 2-1: The structure of part of a DNA double helix. Figure obtained from [23].

Ankur S. Deshmukh


The process by which genetic information is decoded from the DNA and converted to a protein is known as Protein Biosynthesis. Protein Biosynthesis is a multi-step process consisting of two major steps: Transcription and Translation.

Transcription is the process of synthesizing of RNA under the direction of DNA. Both nucleic acid sequences use the same language, and the information is simply transcribed or copied from one molecule to the other. The DNA sequence is enzymatically copied by RNA polymerase to produce a complementary nucleotide RNA strand, called messenger RNA (mRNA), which then carries a genetic message from the DNA to the protein-synthesizing machinery of the cell.

During translation, the mRNA sequence is used as a guide to synthesize a chain of amino acids into a protein sequence. During this process, the mRNA is decoded using specific genetic instructions. A transfer RNA (tRNA), which is a small RNA, then transfers a specific amino acid to the growing polypeptide chain which is being catalyzed at the ribosomal site of protein synthesis.

During and after protein synthesis, amino acid chains often fold to assume the tertiary and quaternary structures commonly associated with proteins. This process is known as protein folding. Many proteins undergo post-translational modifications, which extend the range of a protein’s functions by attaching it to other biochemical functional groups or by formation of disulfide bridges.

The various types of protein structures and sub-structures formed play important roles in how a protein will function. In order to understand the functions a protein performs at a molecular level, it is necessary understand the three dimensional protein structure. This constitutes the field of Proteomics. Researchers employ techniques such as X-ray crystallography or NMR spectroscopy to determine the structure of proteins.

2.1.1 UNDERSTANDING PROTEIN STRUCTURE

Proteins are an important class of macromolecules present in all biological organisms. All proteins are polymers of the 20 standard α-amino acids listed in Table 2.1. Proteins fold into one or more specific spatial conformations, driven by a number of non-covalent interactions such as hydrogen bonding, ionic interactions, Van der Waals' forces and hydrophobic packing, so as to be able to perform their biological functions. In order to decipher protein folding, researchers require a fundamental understanding of the stability contributions of non-covalent stabilizing and destabilizing interactions. These interactions not only guide the initial hydrophobic collapse of a protein into an aqueous environment, but also provide a basis for which a protein assumes its overall structure. In biochemistry, a basic protein structure can be classified into four levels of hierarchy. These levels range from a singular linear arrangement of proteins to complex aggregate structures. These levels are described in detail below:

2.1.1.1. PRIMARY STRUCTURE

A protein’s primary structure is the linear sequence of amino acids. Most protein databases represent a protein in this linear sequence, creating a list of the amino acids which constitute each protein. The sequence of the amino acids is unique to the protein and defines the structure and the function of the protein. In its elemental form, each amino acid in a protein is a four-part molecule starting with an amine group (NH2), also known as the N-terminus, and ending with a carboxylate group (-COOH), known as the C terminus. In between these termini lies the α-Carbon atom (Cα). The Cα is bonded to an R-group and a hydrogen atom. Counting of residues always starts at the N-terminal end, which is the end where the amino group is not involved in a peptide bond.

Ankur S. Deshmukh


Figure 2-2: A general structure of α-amino acid, with the amino group on the left and the carboxyl group on the right. Figure obtained from [1].

Two amino acids are joined by a peptide bond in a condensation reaction. By repeating this process over multiple amino acids, long chains can be generated. This reaction is catalyzed by ribosomes in the translation process. During the formation of the peptide bond, the OH of the carboxylate bond of the first amino acid combines with the H of the amine bond in the second amino acid to form water. Once the bond is formed, the two joined amino acids have only one amine group or N-terminus and one carboxylate group or C-terminus, as shown in Figure 2-3.

Figure 2-3: A condensation reaction between two α-amino acids resulting in a peptide bond. Figure obtained from [1].

Ankur S. Deshmukh


The three dimensional structure of the protein is controlled by the dihedral angles the Cα carbon atom forms with the N-terminus and the C-terminus. The phi angle (φ) is the angle formed by the Cα carbon atom with the amine group. The psi angle (ψ) is the angle formed by the Cα carbon atom with the previous amino acid’s carboxylate group. The Figure 2.4 illustrates the phi and the psi angles.

Figure 2-4: Phi and Psi angles

The R group in the amino acid is called the side chain. A side chain can vary from a single hydrogen atom in glycine through a methyl group in alanine to a large hetrocyclic group in tryptophan [1]. The type and number of side chains in a protein influence its structure. The side chain determines whether the amino acid will be hydrophobic or hydrophilic, polar or non-polar. Table 2-1 lists the 20 standard amino acids and the relative polarity.

Amino Acid Name

One Letter Code

Three Letter Code Category

Alanine A Ala Non-polar Amino Acids (hydrophobic)Cysteine C Cys Polar Amino Acids (hydrophilic)Aspartic Acid D Asp Electrically Charged (negative and hy-

drophilic)Glutamic Acid E Glu Electrically Charged (negative and hy-

drophilic)Phenylalanine F Phe Non-polar Amino Acids (hydrophobic)Glycine G Gly Non-polar Amino Acids (hydrophobic)Histidine H His Electrically Charged (positive and hy-

drophilic)Isoleucine I Ile Non-polar Amino Acids (hydrophobic)Lysine K Lys Electrically Charged (positive and hy-

drophilic)Leucine L Leu Non-polar Amino Acids (hydrophobic)Methionine M Met Non-polar Amino Acids (hydrophobic)Asparagine N Asn Polar Amino Acids (hydrophilic)Proline P Pro Non-polar Amino Acids (hydrophobic)Glutamine Q Gln Polar Amino Acids (hydrophilic)Arginine R Arg Electrically Charged (positive and hy-

drophilic)Serine S Ser Polar Amino Acids (hydrophilic)Threonine T Thr Polar Amino Acids (hydrophilic)

Ankur S. Deshmukh


Valine V Val Non-polar Amino Acids (hydrophobic)Tryptophan W Trp Non-polar Amino Acids (hydrophobic)Tyrosine Y Tyr Polar Amino Acids (hydrophilic)Unknown X UNK Unknown Protein

Table 2-1: Table of standard amino acid abbreviations and side chain properties

Hydrophobic amino acids repel a mass of water and tend to be non-polar. They do not form hydrogen bonds with any ionic group. Water is electrically polarized, and hence is able to form hydrogen bonds internally. But since hydrophobic amino acids are not electrically polarized, water repels hydrophobes, in favor of bonding with itself. This is true for all polar solvents. It is this effect that causes the hydrophobic interaction. To prevent destabilization, hydrophobic amino acids tend to be buried in the center of the protein away from the surrounding aqueous solution. For similar reasons, hydrophilic amino acids occur on the protein surface. The hydrophilic residues can be polar or electrically charged. The electric charge is +ve if the side chain is basic and –ve if the side chain is acidic. The bonds formed due to these interactions are also known as ionic bonds. The distributions of hydrophobic and hydrophilic amino acids in the protein determine the tertiary structure of the protein, and their physical location on the outside structure of the protein influences the quaternary structure, by reducing the collective surface area and therefore the amount of water that can influence the protein structure.

Besides these amino acid characteristics, there are electrostatic interactions determined by Van der Waal’s forces and hydrogen bonding that determines the protein structure. Van der Waal’s forces are the attractive and repulsive forces between atoms, molecules and surfaces. They differ from covalent bonds or ionic bonds in that they are caused by the fluctuating polarizations of nearby particles. Hydrogen bonding is another intermolecular force that affects protein structure, characterized by the presence of a hydrogen atom in the intermolecular bond. This hydrogen is chemically bound in one molecule as the proton donor and in the other as a proton acceptor. Figure 2-5 depicts a hydrogen bond formation in water dimer.

Figure 2-5: Hydrogen bonding between amino acids in the proteins. Figure obtained from [25]

In this figure, the water molecule on the right is the proton donor while the water molecule to the left is the proton acceptor. The hydrogen bond which is used as the donor is often covalently bonded to an electronegative atom, oxygen in our case. Thus, the result of this bonding is a dimer which has relatively large dipole-dipole forces. In a protein, hydrogen bonding interactions contribute to the secondary structure of a protein.

Ankur S. Deshmukh


2.1.1.2. SECONDARY STRUCTURE

Due to the interactions between the chemical groups in amino acids, mediated by hydrogen bonds, a few characteristic patterns occur within folded proteins. These recurring shapes describe the secondary structure of a protein. Their repeated occurrence renders a protein stable. Kabsch and Sander [8] in 1983 came up with an actual listing of the secondary structures found in proteins with a known 3D structure. The DSSP (Dictionary of Protein Secondary Structure) code they proposed is frequently used to describe secondary protein structures with single letter codes. The most commonly occurring protein structures in a protein include, but are not limited to, the α-helix, β-sheet and the β-turn.

The α-helix is the most commonly occurring secondary structure in a protein. It is a right-handed coil formation resembling a spring, in which every backbone N-H group donates a hydrogen bond to he backbone C=O group of the amino acids four residues behind it (i + 4 to i hydrogen bonding). Each amino acid corresponds to a 100° turn in the helix. This means that that the α-helix has 3.6 residues per turn. For example, a helix of 36 amino acids long would form 10 turns. A coiled α-helix depicts the tight packing of bonds, leaving almost no free space in the helix. The amino acid side chains are on the outside of the helix pointing roughly downwards.

Figure 2-6: A depiction of α-helix, the most commonly occurring protein structure in coiled-coils. Figure obtained from [1].

The β-sheet is yet another form of protein secondary structure formed by the collaboration of β-strands, connected laterally by 3 or more hydrogen bonds, forming a generally twisted, pleated sheet [2]. In other words, a β-sheet is an extended conformation of amino acids in a zig-zag manner. In a β-sheet, hydrogen bonding occurs between C=O and N-H groups of two or more β-strands. This is in contrast to the α-helix, where all hydrogen bonds involve the same element of the secondary structure. These

Ankur S. Deshmukh


hydrogen bonds can occur among adjacent β-strands in anti-parallel, parallel, or mixed arrangements. In an anti-parallel arrangement, the successive β-strands run in opposite directions; thus, the C-terminus of one β-strand is adjacent to the N-terminus of next β-strand. In a parallel arrangement, all the N-termini of these strands are oriented in the same direction. An individual strand may also exhibit mixed hydrogen bonding pattern, with a parallel strand on one end and an anti-parallel strand on the other end. These structures are depicted in Figure 2-7.

Figure 2-7: A depiction of β-sheet, in anti-parallel and parallel formation. Figure obtained from [2].

The third type of secondary structure, the β-turn, is characterized by the hydrogen bonds in which the acceptor, meaning the main chain carboxyl oxygen (C=O), and the donor residues, meaning the main chain amine group (N-H), are separated by three residues (i to i + 3 hydrogen bonding). Turns are important secondary structures in proteins and occur abundantly on the surface of the protein molecule. They are distinguished by the hydrogen bonding in the i, i + 1, i + 2, and i + 3 residues. Helical regions are excluded from this definition, while turns between β-strands form a special class of turns known as the hairpin [10]. A β-hairpin connects to hydrogen bonded anti-parallel β-strands. Turns can also connect two regular secondary structure elements that do not interact to form what is known as

Ankur S. Deshmukh


diverging turns.Amino Acids vary in their ability to form secondary structures. Proline and Glycine, which are known as helix breakers, have amazing conformational abilities and are commonly found in turns. The most common amino acids that adopt the helical conformations include Methonine, Alanine, Leucine, Glutamate, and Lysine. The bigger amino acids, in contrast, prefer to adopt a β-sheet.

2.1.1.3. TERTIARY STRUCTURE

The tertiary structure is the three dimensional arrangement of a protein, usually developed due to the presence of a variety of amino acids in the side chains. The tertiary structure of a protein is largely determined by the sequence of amino acids in the proteins and the interactions that occur among their side chains. As a result of these side chain interactions, the protein may have a number of folds, bends, and loops, thus assuming its final three dimensional structure.

There are four types of side chain bonding interactions: disulfide bonds, hydrogen bonding, salt bridges, and non-polar hydrophobic bonding. Disulfide bonds are the only covalent bonds and are formed during oxidation of the sulfhydryl groups on Cysteine (C). The hydrogen bonding between side chains occurs mainly between two alcohols, between alcohol and an acid, or between two acids. Salt bridges are ionic interactions, resulting from the neutralization of an acid and amine on the side chains. Any combination of various acids and amine groups in the side chains will have this interaction. The salt bridges contribute towards the strengthening of the helix. The hydrophobic interactions are the most important factors contributing to the stability of the protein. As discussed in the primary structure, these interactions follow the simple solubility rule that likes dissolve likes. The hydrophobic components will repel water or any polar solvent, in turn forming strong bonds with other hydrophobic elements. In many cases this causes in the hydrophobic side chain to be buried in the centre of the protein and the hydrophilic residues to be exposed to the surface of the protein.

2.1.1.4. QUATERNARY STRUCTURE

Many large proteins consist of multiple polypeptide chains, sometimes known as protein subunits. In addition to the tertiary structure of these subunits, these large proteins also possess a quaternary structure. These large proteins in essence are polymers. The most common examples of proteins with quaternary structure are hemoglobin and the DNA polymerase. Changes in the quaternary structure can occur through conformational changes in the underlying subunits or through the orientation of the subunits relative to each other. The forces that affect the tertiary structure of the protein also affect the quaternary structure. The different protein structures discussed above are pictorially represented in Figure 2-8.

Ankur S. Deshmukh


Figure 2-8: Protein structure, from primary to quaternary. Figure obtained from [26].

Now that we understand the different levels of hierarchy in the structural formation of a protein, we can better understand coiled-coils which form the basis of this Master’s Project. The Stable Coil database is built for predicting the α-helical motifs with the ability to form α-helical coiled-coil motifs, and here we take a deeper look into coiled-coils and the importance of studying them.

Ankur S. Deshmukh


2.1.2 COILED-COILS

Many proteins are involved in important biological functions. Kinesin is a protein which transports cellular components between cells, while myosin is a fundamental protein used in muscle contractions, and both of these proteins perform these functions due to the ability of the coiled-coil to uncoil allowing the unattached heads to move.

A coiled-coil is a structural motif in which two or more α-helices are coiled together like strands of a rope. α-helical structures are abundant in proteins. This project focuses on what is perhaps one of the most commonly occurring dimerization motifs in nature, the two stranded α-helical coiled-coil. This structure consists of a two amphiphatic, right handed α-helices that adopt a left handed super coil analogous to a two stranded rope where the non-polar face of the first α-helix is continually adjacent to that of the other helix [16] as shown in Figure 2-9.

Figure 2-9: Classic example of Coiled-coil GCN4 leucine zipper. Figure obtained from [1].

The two stranded coiled-coil is an ideal model for coiled-coils studies because of its rod-like structure, which makes protein folding a one dimensional problem, thereby removing much of the complexity found in globular proteins. Coiled-coils are characterized by hydrophobic amino acids at every third and fourth residue within their sequence. They are distinguished by a heptad repeat defined as abcdefg where positions a and d are the hydrophobic amino acids responsible for the formation and stability of the coiled-coil. Shown below is an example of coiled-coil alongside its heptad repeat.

Ankur S. Deshmukh


Figure 2-10: Positions of amino acids in the coiled-coil. Figure obtained from [3].

The hydrophobic residues occur at positions a, d, a’, and d’ and are indicated in red. These patterns repeat every 3.5 residues in the side chain; thus it takes less than two full heptads for the coiled-coil to turn twice, as indicated in Figure 2-10. The hydrophobic residues are buried in the center, away from the surrounding aqueous solutions, while the hydrophilic residues are exposed to the surface. These hydrophobic interactions provide stability to the coiled-coil by aiding in inter- and intra-helical interactions.

Various researchers over the years, including [7] [12] [18] have shown how not only the hydrophobic heptad repeats but also how the inter-helical and intra-helical electrostatic interactions between amino acids have contributed to the formation and the stability of the coiled-coil structure. A schematic representation of two-stranded, α-helical coiled-coils, with all the hydrophobic and electrostatic interactions is shown below in Figure 2-11.

Figure 2-11: Cross-sectional view of a two-stranded coiled-coil. Hydrophobic and Electrostatic Interactions between two stranded α-helical coiled-coils formed by the homodimerization of 35-residue

polypeptide chains. Adapted from [7].

Ankur S. Deshmukh


Figure 2-11 uses the letters a to g and a’ to g’ designate the positions of the heptad repeat. As discussed earlier the hydrophobic residues interact at a and a’ and d and d’ indicated by open arrows. Electrostatic interactions can occur between b and e (b’ and e’) indicating intrachain i to i + 3 interactions or e and b (e’ and b’) indicating intrachain i to i + 4 interactions (dashed arrows) or g to e’ (g’ to e) indicating interchain i to i’ + 5 interactions (solid arrows) [7]. These interactions can consist of an attraction between the amino acid residues (salt residues) at these positions, or they can be repulsions which can respectively add or subtract from the overall stability of the coiled-coil.

Coiled-coil prediction is an important goal pursued in bioinformatics and theoretical chemistry. Its aim is the prediction of the three-dimensional structure of proteins from their amino acid sequences, sometimes including additional relevant information such as the structures of related proteins. In other words, the goal is to predict a protein's tertiary/quaternary structure from its primary structure. A number of algo-rithms have been created to predict coiled-coils. Most of the current algorithms created use a statistical approach, in which they compare newly discovered proteins to existing ones and determine the probabil-ity of a coiled-coil being present. The University of Colorado at Colorado Springs in conjunction with the Department of Biochemistry and Molecular Genetics at University of Colorado Health Sciences Center has built a protein database that depicts proteins that contain coiled-coil motifs and their stability clusters as determined by the Stable Coil Algorithm [3][4]. This algorithm is based on the stability of the structure determined by the amino acids present in the protein sequence and the structural position of those amino acids. The algorithms mentioned above are described in detail in the following section.

Ankur S. Deshmukh


2.2 BACKGROUND RESEARCH IN UNDERSTANDING COILED-COIL PREDICTION ALGORITHMS

From previous discussions, it can be concluded that it is not only essential to determine the existence of coiled-coils in proteins; it is also essential to determine how the stability of the coiled-coil can be affected by certain amino acid residues at certain positions along the coiled-coil. Traditionally, the three dimensional structure of a coiled-coil has been determined by X-ray crystallography and NMR spectroscopy. Not only are these methods very expensive but they can also be very time consuming. Furthermore, it is highly improbable for a single group of researchers to apply these methods to all the naturally occurring proteins known to man. A better way to approach this problem is through the use of predictive algorithms, which provide the researchers with answers to questions like: Which proteins are more likely to contain coiled-coils? Which proteins are more/less stable due to the presence/absence of a hydrophobic residue or an electrostatic attraction/repulsion, etc.?

Protein structure analysis was thus born out off the desire to determine protein characteristics without conducting laboratory experiments or using crystallography. Processes based on protein statistics and past experiments were generalized to create methods and algorithms, which provide insights into a given protein’s structure and/or stability. This chapter exemplifies some of the predictive algorithms created to catalog coiled-coils present in proteins.

2.2.1. COILS ALGORITHM

The COILS Algorithm [21], an enhanced version of the Lupus Algorithm [20], was developed by Andrew Lupus in 1996. The COILS is a program which compares the given amino acid sequence to a database of known parallel two stranded coiled-coils. The comparison yields a similarity score which is then compared with the distribution of the scores in coiled-coil proteins. Thus the program calculates a probability that a sequence will adopt a coiled-coil.

The similarity scores are calculated by comparing against two different matrices:

MTK is a matrix derived from the sequences of myosins, tropomyosins and keratins (intermediate filaments type I and II).

MTIDK is a new matrix derived from myosins, paramyosins, tropomyosins, intermediate filaments types I - V, desmosomal proteins, and kinesins, calculated by weighing the residue frequencies of different protein families.

Although using the MTIDK matrix results in a 20-30% drop in the generation of false-positives in the prediction algorithm, the results are still biased towards hydrophobic, hydrophilic charged residues. The program produces a fair amount of statistical noise as the window width decreases.

Ankur S. Deshmukh


2.2.2. PAIRCOILS ALGORITHM

PAIRCOILS [15] classifies coiled-coils using a statistical approach and utilizes a matrix similar to the MTIDK matrix used in the COILS Algorithm. The matrix used in PAIRCOILS contains all known coiled-coil sequences, extracted from the GENpept database [13]. Instead of comparing the entire sequence to the database, as is the case with COILS, PAIRCOILS determines conditional probabilities that two amino acids are found in any two heptad positions. These frequencies are then normalized and used to determine the probability that a certain pair of amino acids appears at a given heptad repeat. The probability cut off determines how stringently the data will be scrutinized in detecting the existence of a coiled-coil domain.

Although the PAIRCOIL Algorithm successfully predicts coiled-coils reducing the number of false positives by using a scoring method based on “pairwise probabilities”, it is marred with the same problems as COILS; a large amount of statistical noise is present in the data as the probability cut off increases.

The PAIRCOIL algorithm was extended to become what is known as the MULTICOIL Algorithm [22], in order to identity three stranded coiled-coils as well. The accuracy of the statistical results was limited as the MUTICOIL Algorithm was run against a small subset of proteins.

2.2.3. SOCKET ALGORITHM

The SOCKET [19] program finds the Knobs-into-Holes mode of packing between alpha-helices which is characteristic of coiled-coils. It unambiguously defines the beginning and end of coiled-coil motifs in protein structures and assigns a heptad register to the sequence.

Specifically, the purposes of SOCKET are:

To objectively and unambiguously define the location of a coiled-coil motif in a protein structure, so that its sequence can be used to test new coiled-coil prediction algorithms and benchmark existing ones.

To automatically collect statistics on frequencies of amino acids at each of the heptad positions (abcdefg) of the sequence/structure motif. Such data are useful for training computer programs that predict coiled-coils from primary structure, and for providing insights into new design rules.

To highlight unusual assemblies of alpha-helices that go beyond the traditional coiled-coil, again it is hoped that design principles, founded on knobs-into-holes packing between alpha-helices, will enable us to create novel and useful protein assemblies.

Ankur S. Deshmukh


2.2.4. 2ZIP ALGORITHM - IDENTIFYING LEUCINE ZIPPERS

In order to implement the 2ZIP Algorithm, the TRESPASSER Algorithm is first used to extract from the SWISS-PROT [10] database only those residues that contain annotated leucine zippers, leucine-like zippers, and non-leucine zippers. TRESPAPPER is the algorithm of choice for this extraction, as it has been reported to predict leucine zippers with high reliability. Once this extraction is complete, the 2Zip Algorithm [6] is designed to determine the two general classes of the Leucine Zipper, strict and relaxed. The strict zipper is distinguished by occurrence of the of at least five leucine residues at four heptad repeats. A relaxed zipper occurs where in any of the five positions Leu is replaced with Met, Val or Ile.

The results of the 2ZIP Algorithm show that the annotated proteins in the SWISS-PROT database do not really follow a strict or relaxed definition of the leucine zipper, as had been hypothesized due to the generation of a lot of false-positives. However, the algorithm does demonstrate, based on the appearance of the leucine zippers in DNA binding basic region (bZIP) and helix-loop-helix (bHLH-ZIP), both of which have coiled-coil characteristics, that the presence of a leucine zipper is the hallmark of the coiled-coil itself rather than the leucine repeat.

2.2.5. STABLE INPUT ALGORITHM

All coiled-coil prediction algorithms mentioned thus far are based on statistical probability. The Stable Input Algorithm [3] is the first algorithm created to determine the presence of coiled-coils using the experimentally determined stability and helical propensity values of various amino acids present in the protein sequence. This algorithm also provides stability clusters of amino acids based on the varying amount of residues at a and d positions in the coiled-coil. The SWISS-PROT database was used as the source data for this algorithm. This algorithm is the precursor to the Stable Coil Algorithm implemented in this project.

Once coiled-coils are extracted from the SWISS PROT proteins, the Stable Input Algorithm uses a windowing function over which to calculate the relative stability of the coiled-coil. When researchers tested this algorithm, using window widths of 7 and 11, their results yielded some interesting observations concerning the stability of coiled-coils. According to these results, hydrophobic amino acids occupy hydrophobic a and d residues on average 65% for the SWISS-PROT dataset. As each hydrophobic core is added to the sequence length, the number of hydrophobic clusters decreases by a factor of 2, while the number of non-hydrophobic clusters decreases by a factor of 8. Also, the cluster frequency decreases as the heptad length increases. The Stable Input Algorithm does not evaluate the intermediate positions in the coiled-coil as strictly as it does the start and end positions. The result is that clusters are missed about 70% of the time. Also, researchers found it difficult to compare results from different sequences or perform quantitative queries, as the results were not stored in a database. Furthermore, they found that the algorithm is more susceptible to false positives; this can be attributed to the shortness of the windows lengths and the method in which the stability values were assigned to each amino acid.

It is interesting to observe that although all of these algorithms predict coiled-coils in proteins, either by using statistical approaches or using the stability values, none of them store this data to allow users to perform customized searches. All of the above algorithms require the user to enter a protein sequence or a file in a certain format to produce the desired results. Not only is this approach inconvenient, it is also time consuming, particularly if the users want to run large set of data.

Ankur S. Deshmukh


The initial emphasis of this project is to retrieve additional information from this SWISS-PROT database concerning the electrostatic interactions among the amino acids within these coiled-coils. The scope of this project also entails helping researchers study in detail the role various amino acid residues play in the hydrophobic core of the coiled-coils. Finally, the project will try to improve the performance, accuracy, and user friendliness of the first rendition of the Stable Coil Algorithm, described in the next chapter. Hence, this project was undertaken to provide the researchers at UCHSC with a bigger, more readily accessible dataset of coiled-coils. The architecture and implementation of the database and the website are described detail later in this document.

Ankur S. Deshmukh


Chapter 3

STABLE COIL ALGORITHM

The researchers at UCHSC have used a model protein, consisting of two identical 38 residue polypeptide chains covalently linked at their N termini via a disulfide bridge, to determine the effects that substituting different amino acids in a coiled-coil sequence may have on the coiled-coil stability. This work forms the basis for the design of new coiled-coil structures, to allow better understanding of the structural relationships between amino acids in a protein sequence, and also provides impetus to the design of new algorithms to predict the presence of coiled-coils within the native protein sequences. The study of the coiled-coil domain has a number of advantages. These advantages area best detailed by [5]:

Abundant motif in proteins Only one type of secondary structure is present, i.e., the α-helix Only two interacting α-helices are required to introduce tertiary and quaternary structure Diversity in length makes it an ideal system to test predictions All non-covalent interactions that stabilize the three-dimensional structure of the proteins are

found in the coiled-coil domain Experimentally easy to analyze structure and stability.

To understand the proteins and the functions they perform, it is necessary to predict the occurrence of a coiled-coil before performing expensive and time consuming experiments. Hence the researchers at UCHSC have experimentally derived stability values for the twenty amino acids in their different heptad positions as described in the Table 3.1.

Ankur S. Deshmukh


Amino Acid Name

One Letter Code

Three Letter Code

Stability Value at Offset A

Stability Value at Offset D

Stability Value at Other Posi-

tionsAlanine A Ala 1.245 1.8 0.528Cysteine C Cys 1.245 1.8 0.237Aspartic Acid D Asp -0.75 0.9 0.116Glutamic Acid E Glu 0.255 0.45 0.176Phenylalanine F Phe 2.75 2.4 0.264Glycine G Gly 0 0 0Histidine H His 0.67 1.4 0.182Isoleucine I Ile 3.185 3.3 0.325Lysine K Lys 1.045 0.9 0.385Leucine L Leu 2.985 3.7 0.446Methionine M Met 2.96 3.4 0.369Asparagine N Asn 1.67 1.5 0.182Proline P Pro -10 -10 -5Glutamine Q Gln 1.18 2.05 0.336Arginine R Arg 0.86 0.35 0.495Serine S Ser 0.605 0.9 0.182Threonine T Thr 1.345 1.2 0.154Valine V Val 3.295 2.35 0.231Tryptophan W Trp 1.635 1.75 0.27Tyrosine Y Tyr 2.285 2.5 0.237

Table 3-2: Helical Propensity and Stability Values of the 20 standard amino acids at various positions in the heptad

Using these stability values as its inputs, the Stable Coil Algorithm determines the presence of a coiled-coil and the offset at which this coiled-coil occurs in a given protein. Thus, it can be said that the Stable Coil Algorithm is based on the structural stability of the coiled-coil region. The goal of this project is to provide researchers with enough data to perform quantitative analysis on a set of proteins and coiled-coils. The section below describes the workings of the Stable Coil Algorithm.

Ankur S. Deshmukh


3.1 STABLE COIL ALGORITHM: PART I

PROBLEM: Calculate the stability arrays for each protein in the Stable Coil database.

INPUTS: The protein sequence (protein_array), The window size (window_size), The array of stability coefficients of amino acids depending on their heptad locations (stability_coefficients)

OUTPUTS: The seven scoring arrays containing the stability scores for an individual protein, where each array starts at a different heptad offset a thru g (score_array)

ALGORITHM:

1 FOR heptad_offset ← a to g

2 DO local_offset ← heptad_offset

3 FOR i ← 1 to length (protein array)

4 DO stability_array[heptad_offset][i] ← stability_coefficients[local_offset][protein array[i]]

5 IF local_offset = g

6 THEN local_offset = a

7 ELSE local offset = local offset + 1


9 DO FOR i ← 1 to length (protein array)

10 IF length(protein_array) – i > window_size 11 THEN score_array[heptad_offset][i] ←

j = i

i + window_size

stability_array[heptad_offset][j]

12 ELSE score_array[heptad_offset][i] ←

j = i

length (protein_array)

stability_array[heptad_offset][j]

13 RETURN score_array

Ankur S. Deshmukh


3.2 STABLE COIL ALGORITHM: PART II

PROBLEM: Determine the presence of a coiled-coil in a protein.

INPUTS: The cut off value (cutoff_value), The seven scoring arrays determined in Part I (score_array),

OUTPUTS: The coiled-coil count array containing the number of coiled-coils present in the protein sequence (coiled_coil_array)

ALGORITHM:


2 DO local_offset ← heptad_offset

3 IF score_array[heptad_offset][i] >= cutoff_value 4 THEN marker_array[heptad_offset][i] ← 1

5 ELSE marker_array[heptad_offset][i] ← 0 6 counter ← 0

7 IF marker array contains 42 or more consecutive ones 8 THEN coiled_coil_array[counter] = Amino acids corresponding to the maker sequence 9 counter ← counter + 1

10 RETURN coiled_coil_array

Ankur S. Deshmukh


In the steps 1 to 8 of PART I of the algorithm, we produce seven permutations of the stability values, each starting at a different heptad offset. Initially the protein sequence is assumed to start at heptad offset a. Then we assign a stability value to each amino acid in the protein sequence depending on its heptad position. The stability coefficients required for this are obtained from Table 3.1, which is experimentally determined by researchers at UCHSC. This process is repeated seven times, wherein each time we have the protein sequence starting at a different heptad offset (a, b, c, d, e, f and g). The resulting output of these steps is seven stability arrays (stability_array) each starting at different heptad offset.

Example:

Heptad Offset Position a b c d e f g

Sequence Of Amino Acids M D Y L D L G

Stability Values 2.96 0.116 0.237 3.7 0.116 0.446 0.000

Table 3-3: Coiled-coil Sequence starting at offset a

Heptad Offset Position b c d e f g a



Table 3-4: Coiled-coil Sequence starting at offset b

To detect the presence of a coiled-coil, we use two experimentally determined values, the cutoff value of 38 and the window length of 42. The values are experimentally proven to be the best for predicting coils at UCHSC. The next step is to calculate seven scoring arrays obtained by aggregating the stability arrays. This is where we use the window length. The aggregation is performed for 42 residues at a time. If the number of residues left does not equal 42, we just aggregate the values till the end of the sequence. This provides us with seven arrays, known as the scoring arrays (score_array).

Example:

Heptad Offset Position a b c d e f g



Scoring Arrays 7.575 4.615 4.499 4.262 0.562 0.446 0.000

Table 3-5: An aggregation of stability values 42 amino acids at a time

Ankur S. Deshmukh


The next experimentally determined value, the cutoff value of 38, is used here. If the aggregate scoring value for each amino acid in the protein sequence is greater than or equal to 38, we mark the scoring array as 1 else we mark it 0, thus generating a marker_array. Then, we look for the occurrence of 42 or more consecutive 1’s in the marker_array. If we find this pattern, then we predict the presence of a coiled-coil with the starting location of the pattern as the starting heptad offset of the coiled-coil. Only coiled-coils with 42 or more sequences are considered for this project as researchers at UCHSC are interested in cluster patterns found in large coiled-coils.

Example:

Heptad Offset Position a b c d e f g…

Sequence Of Amino Acids M D Y L D L G…

Scoring Arrays 52.75 50.756 40.756 38.254 37.656 ….... ……Coiled-coil Arrays 1 1 1 1 0 …… ……

Table 3-6: Determining the presence of a coiled-coil in the protein sequence

Ankur S. Deshmukh


3.3 CLUSTER PATTERNS IN COILED-COILS

Once we have predicted an occurrence of the coiled-coil, the presence of a cluster can be determined by the particular hydrophobic residues occurring at a and d positions. If a certain hydrophobic amino acid, i.e. Phenylalanine, Isoleucine, Leucine, Methionine, Valine, or Tyrosine, is found in the a or d position then the cluster sequence gets 1, or else it gets 0.

Example:

Heptad Offset Position d e F g a b c d e f g a b c

Sequence Of Amino Acids L S T R I Y M V Q P N L G P

Cluster Numbering 1 1 1 1

Table 3-7: Determining the presence of a cluster (stabilizing or de-stabilizing) in the coiled-coil sequence

For this sequence, a cluster pattern would be 1111 as all the amino acids in a and d positions are hydrophobic. The occurrence of three or more 1’s is classified as a stabilizing cluster, while the occurrence of three or more 0’s is termed as a destabilizing cluster indicating regions of lower stability and flexibility. A destabilizing cluster in a coiled-coil consists of the following amino acid residues: Alanine, Cysteine, Aspartic Acid, Glutamic Acid, Glycine, Histidine, Lysine, Asparagine, Proline, Glutamine, Arginine, Serine, Threonine, and Tryptophan. The researchers at UCHSC are interested in cluster patterns of coiled-coils because, as their names suggest, stabilizing clusters aid in the stability of the coiled-coil and destabilizing clusters hinder it. The Stable Coil database currently lists clusters and de-clusters of varying lengths (3, 4, 5, 6, and 6plus). There are also a number of summary queries based on these results that would help the researchers further understand the relationships between clusters and the stability of the coiled-coil.

Ankur S. Deshmukh


Chapter 4

PROJECT ARCHITECTURE

The project began by understanding what enhancements could be made to the first implementation [4] of the Stable Coil Algorithm. The first hurdle to overcome was to restore the database to its original working state. While going through the Perl scripts, which handle the scraping of the data, it was observed that they did not correctly handle the retrieval of updates posted to the SWISS PROT files on the (Expert Protein Analysis System) [10] website. Moreover, it was found that the performance of the Perl scripts could be vastly improved and the scripts could be made less error prone with better error logging. Also, the website could be improved by allowing the users to dynamically chart their results and by allowing them to save their results for offline use. These reasons prompted a rewrite of the original implementation.

The initial step is to recreate the database in MySQL 5.0. MySQL 5.0 was selected as it has become the database of choice for a new generation of applications built on the LAMP stack (Linux, Apache, MySQL, PHP / Perl / Python), which the project also uses. Also, among a vast list of MySQL 5.0’s features, one feature that is most favorable to this project is that it supports stored procedures, which lent a hand in improving the performance of the data loads. The Perl scripts are modified to first compare the files on the ExPASy repository with the files on the local system, so that if any changes in file modification date or file size exist, the Perl scripts can retrieve the data and run it through the Stable Coil Algorithm. The re-written load code then loads the retrieved proteins and coiled-coils into their respective tables. These modified Perl scripts are adept enough to use table driven downloads, which provide the user the flexibility of turning off or on downloads of certain files as per the requirements.

The next step of the project is the extraction of interchain and intrachain electrostatic interactions and heptad offsets from the retrieved coiled-coils. The researchers at UCHSC have provided a list of salt residues as inputs to the data extraction process. The coiled-coils are then split into their heptad offsets and a table is created that lists salt residues occurring in particular heptad offset of the given coiled-coils.

The final step of the project is to produce a HTML/PHP/JavaScript based data-driven website, the purpose of which is to highlight the electrostatic and hydrophobic interactions among amino acid residues in a coiled-coil. This enables the researchers at the Department of Biochemistry and Molecular Genetics at the University of Colorado Health Sciences Center (UCHSC) to query the underlying database, built using data from the ExPASy website. In querying the database, researchers can retrieve information regarding the coiled-coils that are present in the proteins, the salt residues that provide the electrostatic interactions for those proteins, and the amino acid residues that occur at a given heptad offset within the proteins. They can also run summary queries to determine information such as the frequency of amino acid residues occurring at certain heptad positions, or the kinds of electrostatic interactions that occur most frequently among the proteins in the database, to name a few.

Ankur S. Deshmukh


4.1 DATABASE ARCHITECTURE

The backbone of this entire project is the Stable Coil Database, which contains tables with information regarding the proteins, their coiled-coils and the salt bridges that provide the electrostatic interactions. The initial step was to recreate the database using MySQL 5.0, as the original rendition of the database had been corrupted. The MySQL database management system has become quite popular in recent years, particularly in the area of web services where it is used in combination with a web server to construct database-backed web sites that involve dynamic content generation. There are several reasons for the popularity of MySQL: MySQL is fast, and it is easy to set up, use and administrate. Also, among a vast list of MySQL 5.0’s features, the features that are most favorable to this project are that it supports AUTO_INCREMENT and stored procedures.

One of the useful properties of an AUTO_INCREMENT column is that unique values do not need to be assigned manually – MySQL does so automatically. Hence AUTO_INCREMENT is a very useful feature that automatically generates unique primary ID’s when the rows are being inserted.

Stored routines (procedures and functions) are supported in MySQL 5.0. A stored procedure is a set of SQL statements that can be stored in the server. Once this has been done, programmers don't need to keep reissuing the individual statements but can refer to the stored procedure instead. Situations where stored routines can be particularly useful include:

When multiple client applications are written in different languages or work on different plat-forms, but need to perform the same database operations.

When security is paramount. Banks, for example, use stored procedures and functions for all com-mon operations. This provides a consistent and secure environment, and routines can ensure that each operation is properly logged. In such a setup, applications and users have no access to the database tables directly, and can only execute specific stored routines.

Stored routines can provide improved performance because less information needs to be sent between the server and the client. The tradeoff is that this does increase the load on the database server because more of the work is done on the server side and less is done on the client (application) side. The performance improvement provided by the stored procedures was deemed more acceptable than the increase of load on the server.

Ankur S. Deshmukh


4.1.1 STRUCTURE AND CONTENT OF THE TABLES INVOLVED IN PHASE 1

The Cocolysis Database is currently being hosted in on the University of Colorado Health Sciences Server (simbio.uchsc.edu). The tables involved in the first phase project are listed in the ER diagram in Figure 4-1.

Figure 4-12: E-R Diagram detailing the relationship between tblProtein and tblCoiledCoil

The first part of the database consists of two main tables, a table of proteins containing coiled-coils (tblProtein) and a table of the coiled-coils present in the proteins (tblCoiledCoil). There is a bridge table that connects these two tables together called tblProteinCoil. This table provides information on which coiled-coil are present in which proteins. There is also a source look up table (tblDataSourceLookup) which provides the sources from which the data was retrieved.

Ankur S. Deshmukh


Every table has two audit columns, Record_Status and Change_Date. The Record_Status column distinguishes between records that are inserted or updated. The Change_Date column indicates the last time the record was inserted or updated. These columns help to keep track of how and when individual records have been changed.

The next section describes the structure and content of each of these tables in detail.

4.1.1.1 PROTEIN TABLE

The protein table is used to store information regarding proteins that have a coiled-coil motif. As of July 7th 2008, there are 87,368 proteins with coiled motifs in the database. The table structure is listed in Figure 4-2 below.

Figure 4-13: Structure of Protein Table (tblProtein)

The columns specific to this table are described in detail here:

ProteinID – The ProteinID is an auto increment column. New IDs are created every time a row is inserted. This column is also the primary key for this table. The column also has a referential integrity to the tblProteinCoil.ProteinID, which is the bridge table used connect the proteins with coiled-coils.

SourceID – The SourceID is looked up against the tblDataSourceLookup. This ID indicates the source of data. Currently there are just two data sources in the table; SWISS-PROT which is a curated protein sequence database which strives to provide a high level of annotation (such as the description of the function of a protein, its domains structure and its post-translational modifications) and TREMBL, which is a computer-annotated supplement of SWISS-PROT that contains all the translations of EMBL nucleotide sequence entries not yet integrated into SWISS-PROT.

EntryName – This field indicates the ExPASy entry name from the SWISS-PROT database.

Ankur S. Deshmukh


EntryDataClass – This field indicates the ExPASy data class from the SWISS-PROT database. The data class describes the status of the protein, i.e., whether or not that data has been manually reviewed by the UniProtKB curators or not.

Accession – This field refers to the Accession Number associated with the protein entry name. The purpose of accession numbers is to provide a stable way of identifying entries from release to release, because, although protein names can change in future releases, the accession number will remain the same [8]

ProteinName – This field holds the actual protein name, as opposed to the ExPASy name. It should be noted that different organisms may have the same protein although the sequence of the protein may be different.

Organism – This field refers to the species of the organism in which the protein is found. In the ExPASy database, the organism names are provided in both Latin genus and English name formats. For viruses, only the English name is provided.

ProteinSeqLength – This field refers to the number of amino acids in the protein sequence.

ProteinSeqMolWeight – This field indicates the molecular weight of the proteins.

ProteinSeqCRC64 – This field refers to the CRC 64-bit checksum of the protein sequence.

ProteinSeqCreateDate, ProteinSeqModDate – These fields refer to the creation and modification dates associated with the protein.

Sequence – This field stores the amino acids sequences of the protein.

4.1.1.2 COILED-COIL TABLE

The coiled-coil table is used to store information regarding coiled-coils which have been retrieved from the SWISS–PROT proteins using the Stable Coil Algorithm. Only unique coiled-coils are stored in this table, as coiled-coils are found in multiple proteins and these replications could skew the results while performing summary queries. To be considered unique, the coiled-coil must have a different amino acid sequence and a different structural offset. As of June 7th 2008, there are 141,204 unique coiled-coils in the database. The table structure is listed in Figure 4-3 below.

Ankur S. Deshmukh


Figure 4-14: Structure of Coiled-coil Table (tblCoiledCoil)


CoilID – The CoilID is an auto increment column. New IDs are created every time a row is inserted. This column is also the primary key for this table. The column also has a referential integrity to the tblProteinCoil.CoilID, which is the bridge table to connecting the coiled-coils with their corresponding proteins.

CoilSequence – This field stores the amino acid sequence of the coiled-coil. Coil sequence is a sub-string of the protein sequence and is retrieved using the Stable Coil Algorithm. As reserachers are interested in long coiled-coils, these sequences in this table are 42 or more amino acids in length

Cluster – This field stores the clusters found in the coiled-coils.

CoilLength – This field indicates the number of amino acids in the coil sequence.

Offset – This field refers to the starting heptad offset of the coiled-coil. The values range from a to g.

Cluster3, Cluster4, Cluster5, Cluster6, and Cluster6p – These fields store the number of instances where 3, 4, 5, 6, or 7 or more 1’s occurring sequentially in a cluster within a coiled-coil. For example, if the cluster sequence is 0111110101110010, the field Cluster3 would have a value of 1 and Cluster5 would have a value of 1.

De-cluster3, De-cluster4, De-cluster5, De-cluster6, and De-cluster6p – These fields store the number of instances of 3, 4, 5, 6, or 7 or more 0’s occurring sequentially in a cluster within a coiled-coil. For example, if the cluster sequence is 01010001111110, the field De-cluster3 would have a value of 1.

Ankur S. Deshmukh


4.1.1.3 PROTEIN COIL TABLE

In nature, we find that there are proteins which contain many coiled-coils; also the same coiled-coil can be found in multiple proteins. In the technical sense, it can be said the protein sequences and coiled-coil sequences share a many-to-many relationship. The protein coil table is a bridge table for tblProtein and tblCoiledCoil, which stores information on the many-to-many relationships between proteins and coiled-coils. To understand this, a scenario is posited below.

We have two proteins: the P1 protein containing coiled-coils C1 and C2, P2 protein containing the C2 coiled-coil. For the first protein, we insert P1 into tblProtein, C1 and C2 into tblCoiledCoil, and P1-C1 and P1-C2 into tblProteinCoil. For the second protein, we insert P2 into tblProtein and P1-C2 into tblProteinCoil. Thus, we do not insert any coiled-coils for the P2 protein into tblCoiledCoil, as C2 already exists in the table.

As of July 7th 2008, there are 179,054 unique coiled-coils in the database. The table structure is listed in Figure 4-4 below.

Figure 4-15: Structure of Protein Coil Table (tblProteinCoil)


ProteinCoilID – The ProteinCoilID is an auto increment column. New IDs are created every time a row is inserted. This column is also the primary key for this table.

ProteinID – This field is used to indicate the proteins that contain coiled-coils. This column is foreign keyed to the ProteinID table in the tblProtein table.

CoilID – This field is used to indicate the coiled-coils in proteins. This column is foreign keyed to the CoilID table in the tblCoiledCoil table.

CoilLocation – This field indicates the location of the coiled-coil sequence in the protein. This is the only place where we can save this location as coiled-coils share a many-to-many relationship with proteins. Also, since MySQL does not support nested tables the way Oracle does, it is not possible to save this data in either tblProtein or tblCoiledCoil tables.

Ankur S. Deshmukh


4.1.2 STRUCTURE AND CONTENT OF THE TABLES INVOLVED IN PHASE 2

The second part of the Stable Coil database consists of two additional main tables and one other bridge table. The tblSaltBridge contains all the i to i+ 3, i to i + 4 and i to i’ + 5 electrostatic interactions, which include attractions and repulsions between the Lys, Glu, Asp, and Arg amino acids. The lookup table (tblSaltResiduesLookup) provides information on the different salt residue interactions that interest researchers. The researchers can add new residues to the database, and the Perl programs will automatically retrieve these residues from the coiled-coils. The tblSplitHeptadCoils consist of all the coiled-coils split into their individual heptads (gabcdef). A bridge table, tblHeptadSalt, which provides information about which heptad of the coiled-coil contains which salt bridge. The tables involved in the second phase this project are listed in the ER diagram in Figure 4-5. The next section describes the structure and content of each of these tables in detail.

Figure 4-16: E-R Diagram detailing the relationship between tblSaltBridge and tblSplitHeptadCoils

Ankur S. Deshmukh


4.1.2.1 SALT RESIDUES LOOKUP TABLE

The Salt Residues table is a lookup table that provides information on interaction type and salt bridges type. The researchers can add extra records to this table for other salt bridges and the related Perl programs will automatically retrieve information for the salt bridges from the coiled-coil table.

As of July 7th 2008, there are 48 types of salt bridges present in the table. The table structure is listed in Figure 4-6 below.

Figure 4-17: Structure of Salt Residues Lookup Table (tblSaltResiduesLookup)


SaltBridgeID – The SaltBridgeID is an auto increment column. New IDs are created every time a row is inserted. This column is also the primary key for this table. The column also has a referential integrity to the tblHeptadSalt.SaltBridgeID, which is the bridge table that connects the salt bridges with coiled-coil heptads.

SaltResidueID – The SaltResidueID is looked up against the tblSaltResiduesLookup. This Id indicates the type of salt bridge, for example whether an “i to i + 3” salt bridge exists between salt residues Lys and Glu. This column is foreign keyed to tblSaltResiduesLookup.SaltResidueID. Currently there are 48 different attraction and repulsion salt residues in the lookup table.

CoilID – This field is foreign keyed to the tblCoiledCoil.CoilID. This field will tell us what salt bridges are located which coiled-coils.

SaltBridgeMatch – This field refers to all the residues contained within the said salt bridge. These intermediate amino acids can help the researchers understand what are the most common residues occurring in a particular type of salt bridge.

SaltStartLoc – This field refers to start location of the salt bridge in the coiled-coil.

SaltEndLoc – This field refers to end location of the salt bridge in the coiled-coil.

SaltStartOff – This field refers to the starting offset of the salt bridge.

SaltEndOff – This field refers to the ending offset of the salt bridge. The offset fields help researchers identify what salt residues commonly occur at what heptad offsets in a coiled-coil.

Ankur S. Deshmukh


4.1.2.2 SALT BRIDGE TABLE

The salt bridge table is used to store information regarding the electro static interactions (i to i+ 3, i to i + 4, and i to i’ + 5) between the amino acids in a given coiled-coil. The researchers are interested in the following electrostatic interactions:

Attractions RepulsionsLys / Glu Lys / LysGlu / Lys. Lys / ArgLys / Asp Arg /ArgAsp / Lys Arg / LysArg / Glu Glu / GluGlu / Arg Glu / AspArg / Asp Asp / GluAsp / Arg. Asp / Asp

Table 4-8: Amino acid electrostatic interactions which the researchers at UCHSC are interested in

This table provides information on the position of the salt bridge in the coiled-coil, as well as, the start and end heptad offsets of the salt bridge. These residues are searched using a regular expression parser.

Using the salt bridge table, the researchers are able to answer questions such as what is the total number of Lys/Glu i to i + 3 salt bridges and where are these salt bridges distributed within the coiled-coil? . As of July 7th 2008, there are 1,017,241 salt bridges present in the 141,204 unique coiled-coils. The table structure is listed in Figure 4-7 below.

Figure 4-18: Structure of Salt Bridge Table (tblSaltBridge)

Ankur S. Deshmukh



SaltBridgeID – The SaltBridgeID is an auto increment column. New IDs are created every time a row is inserted. This column is also the primary key for this table. The column also has a referential integrity to tblHeptadSalt.SaltBridgeID, which is the bridge table that connects the salt bridges with coiled-coil heptads.

SaltResidueID – The salt residue id is looked up against tblSaltResiduesLookup. This id indicates the type of salt bridge, for example whether it is an “i to i + 3” salt bridge between salt residues Lys and Glu. This column is foreign keyed to tblSaltResiduesLookup.SaltResidueID. Currently there are 48 different attraction and repulsion salt residues in the lookup table.

CoilID – This field is foreign keyed to the tblCoiledCoil.CoilID. This field will give us which salt bridges are located in which coiled-coils.

SaltBridgeMatch – This field refers to all the residues contained within the said salt bridge. These intermediate amino acids can help the researchers determine what the most common residues are occurring in a particular type of salt bridge.

SaltStartLoc – This field refers to start location of the salt bridge in the coiled-coil.

SaltEndLoc – This field refers to end location of the salt bridge in the coiled-coil.

SaltStartOff – This field refers to the starting offset of the salt bridge.

SaltEndOff – This field refers to the ending offset of the salt bridge. The offset fields help us identify what salt residues commonly occur at what heptad offsets in a coiled-coil.

4.1.2.3 COILED-COIL HEPTAD TABLE

The coiled-coil heptad table is generated by splitting the coiled-coils into their respective heptad sequences. A heptad is defined as the sequence of offsets g,a,b,c,d,e,f. The heptad starts with g here to capture all i to i + 5 interactions in a given heptad of a coiled-coil. The table contains the residues occurring at each of these heptads for each of the coiled-coils in tblCoiledCoil. This table helps build queries which determine whether certain residues or certain pairs of residues occurring more frequently than others.

As of June 7th 2008, there are 1,186,214 heptads for 141,204 unique coiled-coils in the database. The table structure is listed in Figure 4-8 below.

Ankur S. Deshmukh


Figure 4-19: Structure of Coiled-coil Heptad Table (tblSplitHeptadCoil)


HeptadOffsetID – The HeptadOffsetId is an auto increment column. New IDs are created every time a row is inserted. This column is also the primary key for this table. The column also has a referential integrity to the tblHeptadSalt.HeptadOffsetID, which is the bridge table to connect the coiled-coil heptads with the corresponding salt bridges.

CoilID – This field is foreign keyed to tblCoiledCoil.CoilID. This field will tell us which heptads belong to which coiled-coils.

OffsetG, OffsetA, OffsetB, OffsetC, OffsetD, OffsetE, and OffsetF – These fields store the amino acid residues of the coiled-coils at the corresponding offsets.

HeptadStartLoc – This field refers to start location of the corresponding heptad in the coiled-coil.

HpetadEndLoc – This field refers to end location of the corresponding heptad in the coiled-coil.

HeptadStartOff – This field refers to the starting offset of the heptad.

HeptadEndOff – This field refers to the ending offset of the heptad.

4.1.2.4 HEPTAD SALT TABLE

The table tblHeptadSalt is a bridge table between the tblSaltBridge and tblSplitHeptadCoils. It stores the salt bridge IDs and heptad offset IDs, where the salt bridge is present in the heptad offset for the given coiled-coil. The table does not store any salt bridges that overlap two heptads.

As of July 7th 2008, there are 549,440 unique coiled-coils in the database. The table structure is listed in Figure 4-9 below.

Ankur S. Deshmukh


Figure 4-20: Structure of Heptad Salt Bridge Table (tblHeptadSalt)


HeptadSaltID – The heptad salt id is an auto increment column. New ids are created every time a row is inserted. This column is also the primary key for this table.

HeptadOffsetID – This field is foreign keyed to the tblSplitHeptadCoils.HeptadOffsetID. This ID is used to indicate all the heptads which contain salt bridges.

SaltBridgeID – This field is foreign keyed to the tblSaltBridge.SaltBridgeID.

Ankur S. Deshmukh


4.1.2.5 SCRAPE FILE TABLE

Finally there is a table which drives the data scrape process (tblScrapeFile). It contains information about the location of the file on an ftp server, the last time the file was update on the source and the size of the file. This information is used to check if the file has been changed on the host and, if so, to retrieve it. Once the new version of the file has been retrieved, the file size and the mod date are updated, so as to store the most current attributes of a file.

As of July 7th 2008, there are 3 entries in the tblScrapeFile in the database. The first entry is used to retrieve the SWISS-PROT file which gets updated on the source site monthly. The second entry is used to retrieve the TREMBL file. The final entry is used to retrieve the updates to the SWISS-PROT database. If the researchers would like to add more datasets, they simply need to add an entry to this table. The only factor they must take into account is the file format of the dataset. In order for the data in the new file to successfully load into the database, it must be of the same format as the SWISS-PROT file, a format in which protein sequences are commonly represented. The table structure is listed in Figure 4-10 below.

Figure 4-21: Structure of Scrape File table (tblScrapeFile)


ScrapeID – The ScrapeID is an auto increment column. New IDs are created every time a row is inserted. This column is also the primary key for this table.

FtpSiteUrl – This field stores the main URL to the ftp site which hosts the data. The Perl program has been designed specifically to get data from a FTP server as most protein distribution files are large in size and are almost always accessible via FTP.

FtpDirPath – This field refers to directory path of the file on the FTP server FtpFileName – This field is the actual file name retrieved from the FTP site.

LocalDirPath – This field refers to the directory path on the local machine where we plan to save the data obtained from the FTP site.

Ankur S. Deshmukh


LastModDate – This field refers to the last time the file on the website was modified. If this date and date on the file do not match, Perl programs mark the file as changed and scrape the new version of the file.

SizeInBytes – This field refers to the file size in bytes as retrieved from the file during the most recent scrape. If this file size and size of the file on the FTP site do not match, , the Perl Programs mark the file as changed and scrape the new version of the file.

PullWeeklyFlag – This field acts as a flag which tells the main scrape program whether or not to scrape a file every week. A Perl program in turn calls a stored procedure, which changes the status of the PullWeeklyFlag field based on the last time a file was scraped. The field has just two values ‘Y’ or ‘N’.

ScrapeFileFlag – This field indicates whether or not the scrape program requires a restart. Because the Perl programs are scraping huge files, it is entirely possible that the data transfer might end before finishing the complete download. If this happens and the researchers are trying to retrieve multiple files, this field determines which was the last file successfully scraped. Every time the program starts it looks at which files it needs to be retrieved and marks the ScrapeFileFlag field as ‘N’. Once the scrape of a particular file is completed, the ScrapeFileFlag field is marked as ‘Y’.

Ankur S. Deshmukh


4.1.3 MATERIALIZED VIEWS

To provide faster access from the user interface, the database includes materialized views. Materialized views are exactly like standard views which are based on certain select queries. A materialized view, however, takes a different approach, wherein the query result is cached as a concrete table that may be updated from the original base tables from time to time. This enables much more efficient access, at the cost of some data being potentially out-of-date. It is most useful in data warehousing scenarios, where frequent queries of the actual base tables can be extremely expensive. In addition, because the view is manifested as a real table, anything that can be done to a real table can be done to the view, the most important being the ability to build indexes on any column, thus enabling drastic speedups in query time. In a normal view, it is typically only possible to exploit indexes on columns that come directly from (or have a mapping to) indexed columns in the base tables. But MySQL 5.0 does not support materialized views. Hence a new approach was designed to create tables than will be automatically updated, just like a materialized view.

MySQL provides a “Create table as” (CTAS) syntax which allows users to create tables using a select statement. This property was utilized to create a procedure which takes a select statement and a table name as inputs and creates a table on the fly. This procedure also assigns primary keys and any indexes if specified. The details of the stored procedure that creates these materialized views are described in Appendix A. The updates to these materialized views are scheduled using the UNIX crontab. Also for each of these queries it takes from 1.2 seconds to 5 seconds to refresh, which provides for a minimal down time, if any.

The materialized views were created to provide answers to questions like what roles different amino acids residues play in the hydrophobic core (a and d positions), or What is frequency of the occurrence of pairs of residues in the coiled-coils. Some of the important materialized views are described here in detail.

4.1.3.1 AMINO ACID OCCURRENCES

This materialized view provides the frequency of occurrence of a pair of amino acids at a certain heptad offset in a coiled-coil. This view, in turn, allows users to find out which amino acid pair occurs most frequently at a given heptad position and which occur less frequently. This table is based on a select query performed on the tblSplitHeptadCoils.

As of July 7th 2008, there are 2,010 distinct amino acid pair occurrences out of a total possible 2,800. The most common occurring pair of amino acids is ‘L-L’ at offsets a and d respectively, and they occur 57,854 times. The table structure is listed in Figure 4-11 below.

Figure 4-22: Materialized view of Amino Acid Occurrences (matview_AminoAcidOcurrences)

The columns specific to this materialized view (table) are described in detail here:

Ankur S. Deshmukh


Amino Acid Pair – This column defines distinct pairs of amino acids which are found within the coiled-coils in the database.

Offset Location 1 – This field stores the heptad offset of the first amino acid for the specific oiled coil

Offset Location 2 – This field stores the heptad offset of the second amino acid for the amino acid pair we have found.

Offset Pair Occurrence– This field stores the number of occurrences of the amino acid pair in the

different heptad offsets, (a, b, c, d, e, f and g).

4.1.3.2 COIL LENGTH VS. CLUSTER PER COIL

This materialized view provides data on the frequency of occurrence coiled-coils of a certain length. In addition, it also provides the normalized value of the number of clusters occurring in coiled-coil of a given length. This view will allow users to see how cluster distribution varies as coil length increases. The coils have been divided into seven different subgroups divided by coiled-coil length:

1. Coiled-coils with coil length less than 50 amino acids2. Coiled-coils with coil length between 50 and 59 amino acids3. Coiled-coils with coil length between 60 and 69 amino acids4. Coiled-coils with coil length between 70 and 79 amino acids5. Coiled-coils with coil length between 80 and 89 amino acids6. Coiled-coils with coil length between 90 and 99 amino acids7. Coiled-coils with coil length greater than 100 amino acids

The table structure is listed in Figure 4-12 below.

Figure 4-23: Materialized view of Coiled-coil Length vs. the Cluster Count (matview_CoilClusterCount)

The columns specific to this materialized view (table) are described in detail here:

Coiled-coils By Length – This field splits coiled-coils into different categories by length.

Coiled-coil Count – This field provides information on the number of coiled-coils in each group once they have been split by length.

Ankur S. Deshmukh


Stabilizing clusters per coil – This field stores a normalized value of the Total Stabilizing Clusters by the Total Number of Coiled-coils.

Stabilizing Clusters per coil in Coiled-coils with Clusters– This field stores a normalized value of

the Total Stabilizing Clusters by the Total Number of Coiled-coils that actually contain stabilizing clusters.

Destabilizing clusters per coil – This field stores a normalized value of the Total Destabilizing Clusters by the Total Number of Coiled-coils.

Destabilizing Clusters per coil in Coiled-coils with Clusters– This field stores a normalized value

of the Total Destabilizing Clusters by the Total Number of Coiled-coils which actually contain destabilizing clusters.

These are a couple of materialized views to accelerate the execution of the searches. The details concerning the creation of these materialized views and, the SQL statements used to create these views are covered in Appendix B.

Ankur S. Deshmukh


4.2 PERL CODE DESIGN

One of the most important and complex parts of the Stable Coil project was the process of loading the data from the source data. The raw data for the scrape is obtained from the ExPASy (Export Protein Analysis System)1 server. The ExPASy database is an open source database developed to help researchers by providing the latest annotated protein sequences. The entire database is reposted monthly while protein updates that have been sequenced as a result of various genome projects are added to the database weekly. This database can be downloaded at ftp://ftp.expasy.org/databases/swiss-prot/release. The database is available in XML and DAT formats and can be downloaded in compressed or uncompressed formats. The updates to the protein sequences are available at ftp://ftp.expasy.org/databases/swiss-prot/updates_compressed in the DAT format. This project uses the DAT format in order to keep the data type consistent across the entire process. The main database is currently 2.9 gigabytes in size. The weekly updates range from 30 to 40 megabytes.

There are four Perl programs which are used to retrieve source data from the website, parse the data using the Stable Coil Algorithm and load the data into the MySQL database. They are as follows:

1. StableCoil_Algorithm_setpullweekly_call.pl2. StableCoil_Algorithm_scrape_parse_load.pl3. StableCoil_Algorithm_saltresidues_heptadoffsets_extract.pl4. Older_Files_Archiving_Removal.pl.

The first program is used to set the PULLWEEKLYFLAG in the tblScrapeFile table. This program sets the flag depending on how long it has been since the last scrape for the file. For updates the PULLWEEKLYFLAG is set every seven days, while for the entire database it is set every 180 days. The second program is the main program that actually scrapes and loads the data. It starts with first checking whether the PULLWEEKLYFLAG has been set for the specified file. If it is set then the program compares the LASTMODDATE and SIZEINBYTES in the tblScrapeFile against the mod date and the size of the file on the ftp website; if they differ, the file is scraped. At this time, the LASTMODDATE and SIZEINBYTES for the particular file are also updated. Once the file has been scraped the sequences, their molecular weights, their create dates, and any other relevant information are extracted from the file. This information is then run through the Stable Coil Algorithm which predicts the presence and the location of the coiled-coils in the protein sequences. If the protein sequence length is less than 42, the program ignores it, as the researchers are interested in coiled-coils with 42 or more amino acids.

Next the program checks to see if the coiled-coil already exists in the tblCoiledCoil table. If the coiled-coil does exist, it retrieves the CoilID for that coiled-coil. If the coiled-coil does not exist in the table, the program inserts a new record in to the table and retrieves its CoilID. Similarly we insert a new protein sequence in tblProtein or get a ProteinID from the table depending on whether the protein already exists in the database. The next step is to determine the clusters in the coiled-coils using Perl’s pattern matching operators. The pattern matches are done as follows.

1 The ExPASy website is located at http://expasy.org

Ankur S. Deshmukh

ftp://ftp.expasy.org/databases/swiss-prot/updates_compressed

ftp://ftp.expasy.org/databases/swiss-prot/release


Cluster3: /(?:(?<=[^1])|(?<=^))111(?=[^1]|$)/gCluster4: /(?:(?<=[^1])|(?<=^))1111(?=[^1]|$)/gCluster5: /(?:(?<=[^1])|(?<=^))11111(?=[^1]|$)/gCluster6: /(?:(?<=[^1])|(?<=^))111111(?=[^1]|$)/gCluster6p: /1{7,}/g

De-cluster3: /(?:(?<=[^0])|(?<=^))000(?=[^0]|$)/gDe-cluster4: /(?:(?<=[^0])|(?<=^))0000(?=[^0]|$)/gDe-cluster5: /(?:(?<=[^0])|(?<=^))00000(?=[^0]|$)/gDe-cluster6: /(?:(?<=[^0])|(?<=^))000000(?=[^0]|$)/gDe-cluster6p: /0{7,}/g

Once the pattern matches are complete, the program inserts the ProteinID and the CoilID in the bridge table. Thus only distinct proteins and distinct coiled-coils are stored in the database.

The third program, StableCoil_Algorithm_saltresidues_heptadoffsets_extract.pl, is used to retrieve salt residues and their heptad offsets from the coiled-coils. The results retrieved from this program provide insights into the relationship between the salt residues and the various heptads of the coiled-coils. There are 3 types of electrostatic interactions, i to i + 3, i to i + 4 and i to i’ + 5, which can be searched as i..i + 3, i…i + 4 and i….i + 5. The salt residues that the researchers are interested in are as follows.

Start Amino Acid

End Amino Acid

Interaction Type Salt Bridges Type

K E A Intrachain i to i + 3K E A Intrachain i to i + 4K E A Interchain i to i' + 5E K A Intrachain i to i + 3E K A Intrachain i to i + 4E K A Interchain i to i' + 5K D A Intrachain i to i + 3K D A Intrachain i to i + 4K D A Interchain i to i' + 5D K A Intrachain i to i + 3D K A Intrachain i to i + 4D K A Interchain i to i' + 5R E A Intrachain i to i + 3R E A Intrachain i to i + 4R E A Interchain i to i' + 5E R A Intrachain i to i + 3E R A Intrachain i to i + 4E R A Interchain i to i' + 5R D A Intrachain i to i + 3R D A Intrachain i to i + 4R D A Interchain i to i' + 5D R A Intrachain i to i + 3D R A Intrachain i to i + 4

Ankur S. Deshmukh


Start Amino Acid

End Amino Acid

Interaction Type Salt Bridges Type

D R A Interchain i to i' + 5K K R Intrachain i to i + 3K K R Intrachain i to i + 4K K R Interchain i to i' + 5K R R Intrachain i to i + 3K R R Intrachain i to i + 4K R R Interchain i to i' + 5R R R Intrachain i to i + 3R R R Intrachain i to i + 4R R R Interchain i to i' + 5R K R Intrachain i to i + 3R K R Intrachain i to i + 4R K R Interchain i to i' + 5E E R Intrachain i to i + 3E E R Intrachain i to i + 4E E R Interchain i to i' + 5E D R Intrachain i to i + 3E D R Intrachain i to i + 4E D R Interchain i to i' + 5D E R Intrachain i to i + 3D E R Intrachain i to i + 4D E R Interchain i to i' + 5D D R Intrachain i to i + 3D D R Intrachain i to i + 4D D R Interchain i to i' + 5

Table 4-9: List of Salt Bridges which provide i to i + 3, i to i + 4 and i to i’ + 5 electrostatic interactions

These residues are determined by using the regular expressions along with the RegExp::Exhaustive Perl module. This module finds lookback, lookahead, and matches within matches. In addition to the occurrence of salt bridges, the third program also stores the positions of the salt bridges in tblSaltBridge. These heptad offsets are calculated by splitting the coiled-coils into seven residues at a time and storing the respective amino acid in columns Offsetg, Offseta, Offsetb, Offsetc, Offsetd, Offsete, Offsetf in the table tblSplitHeptadCoils, depending upon the amino acid’s offset. If a salt bridge occurs within a heptad offset, the HeptadOffsetID and SaltBridgeID are added to the bridge tblHeptadSalt. If the researchers are interested in finding any new salt bridges, they need only to add new entries to the tblSaltResiduesLookup table.

Thus, using the heptad repeat designations abcdefg, it is possible to know the number of types of salt residues occurring in the coiled-coils and their distribution relative to their position. It is also possible to identify the frequency of occurrence of amino acids in certain heptad positions.

Ankur S. Deshmukh


Each of the above programs uses the in-house Log and Time modules. The Log module creates log files which log every step of the program and the Time module benchmarks the time taken for each step. The fourth program, Older_Files_Archiving_Removal.pl, used to periodically clean out the log files every 30 days, so as to prevent an INODE failure on the server. The process flow is summarized by the flow chart in Figure 4-13 below.

Ankur S. Deshmukh


Figure 4-24: Process Flow Diagram for the Stable Coil Algorithm

4.3 WEBSITE ARCHITECTURE

Another component of the project is the data driven website, through which users will access the information in the Stable Coil database. This front end was developed in PHP and uses Fusion Charts to

Ankur S. Deshmukh


create dynamic Flash charts, which are based off of XML codes; the MySQL database serves as the backend. The website was created based on search parameters provided by the researchers at UCHSC. The website as of this writing is located at http://simbio.uchsc.edu/StableCoil/. There are four search pages, each of which provides options to export the data in HTML or Microsoft® Word or Microsoft® Excel formats. Each page also includes a help option describing what the page does. The searches are highlighted to provide easy visual recognition of the data. The searches are stored per user between different sessions using the $_SESSION variable in PHP. Also, each search page contains a link allowing users to reset the search criteria.

The users can search variations in proteins, coiled-coils and the salt bridges and also provide summary queries that describe certain anomalies or points of interests in the dataset. The summary queries are performed by creating materialized views, which are based on SELECT statements and use the JOIN statement to combine two or more tables. The following sections describe the different pages of the website and the search parameters associated with these pages. The searches are performed on the requirements provided by researchers at UCHSC.

Ankur S. Deshmukh

http://simbio.uchsc.edu/StableCoil/


4.3.1 INDEX PAGE

Figure 4-25: Index Page of the Stable Coil website

The index page provides describes what the Stable Coil project is all about. It gives the users the necessary insights into the world of coiled-coils. It also provides a summary of the number of proteins, coiled-coils, and salt bridges in the database and displays the last date on which the database was updated. This webpage is intended to be a small introduction to the steps that the Stable Coil Algorithm performs.

Ankur S. Deshmukh


4.3.2 PROTEIN SEARCH PAGE

Figure 4-26: Protein Related Search Page

This page searches both the tblProtein table and the tblProteinCoil table to retrieve information on the coiled-coils related to the proteins. It also provides salt bridges associated with the proteins using the tblSaltBridge. The results include the ExPASy protein name, the protein sequence, the organism in which the protein is found and the date the protein was annotated in the ExPASy database. The users should use the accession number instead of the protein entry name, to compare the results to the ExPASy database. The following parameters determine which fields the users can search.

Ankur S. Deshmukh


Description of the search fields for the Protein Related Search Page

#Search Pa-

rameter Name

Explanation of the Search Parameter

1.

Entry Name Users can use the SWISS-PROT ExPASy Entry Name to search a protein value. As per the ExPASy web site, the Ex-PASy Entry Name is not unique across different releases of their protein database. If you are looking for a particular pro-tein, the Accession Number is the best field to search.

2.

Entry Data Class

The proteins are divided into two Entry Data Classes: Re-viewed and Non-Reviewed. A protein is marked as reviewed if it as been checked by the analysts at ExPASy, otherwise it is marked as non-reviewed.

3.

Accession Number

Accession Number is a unique value used to distinguish be-tween different proteins in the same release and also to distin-guish proteins with the same ExPASy names in different re-leases.

4.

Protein Name When you enter the name of a protein, for example Myosin, the program gives the different Myosin proteins in the data-base such as Myosin-1, Myosin-2, Myosin Va, Myosin Vb, etc. This is a case-insensitive search i.e. Myosin or mYosin or myosin would all return the same results.

5.

Organism This field indicates the type of organism in which the protein occurs. The underlying database stores the organism’s genus name as well as it’s common name. Ex: Homo Sapiens (Hu-man). This search is case insensitive as well.

6.

Protein Se-quence Length

A search on this field can be performed if you want to restrict your result set of proteins to a certain length. The values in these search fields must have to be a valid integer.

7.

Protein Se-quence Cre-

ate Date

A search on this field can be performed if you want to restrict your result set proteins to proteins created in a certain time frame. The values in these fields have to be of the format MM/DD/YYYY.

8.

Sequence This field allows you to search the protein sequence. Protein sequences are always represented in upper case and hence the search is case sensitive. For example, 'LKLL' will yield results while 'lkll' will not yield any results. You can also search for a sequence as 'L_LL' where the un-derscore represents any amino acid.

Table 4-10: Search Parameters used on the Protein Related Search Page

Ankur S. Deshmukh


4.3.3 COILED-COIL SEARCH PAGE

Figure 4-27: Coiled-coil Related Search Page

This webpage searches through the coiled-coils which have been extracted from the proteins using the Stable Coil Algorithm. The search results include the coiled-coil sequence, the heptad offset at which the coiled-coil occurs in the protein, and the clusters that are found in this coiled-coil. In the results section, the 'a' and 'd' residues for every coiled-coil are highlighted in red. Every coiled-coil record contains links to the proteins and salt bridges related to that coiled-coil. The following parameters determine which fields the users can search.

Ankur S. Deshmukh


Description of the search fields for the Coiled-coil Related Search Page

#Search Pa-

rameter Name


1.

Coil Se-quence

This field allows you to search the coiled-coil sequence. Coiled-coil sequences are always represented in upper case; hence the search is case sensitive. For example, 'LKLL' will yield results while 'lkll' will not yield any results. You can also search for a sequence as 'L_LL' where the under-score represents any amino acid.

2.

Cluster The clusters in the coiled-coils are determined by the presence or absence of the following hydrophobic amino acids: Phenyl-alanine (F), Isoleucine (I), Leucine (L), Methonine (M), Va-line (V) and Tyrosine (Y) at a and d positions. If any of the above hydrophobic amino acids appear at a or d position in the coiled-coil the amino acid is represented by 1. If not, the value is 0. The number of 3 or more consecutive 1's or 0's in the coiled-coils are denoted as a cluster or a de-cluster. If the first residue in the cluster maps to a 1, the cluster starts with a 0, so as to group all 1's together.

3.

Coil Length The coil length field allows you to restrict your search results to coiled-coils between certain lengths. The Stable Coil data-base only retrieves coiled-coils which contain 42 or more amino acids.

4.

Offset This field allows you to search for coiled-coils with certain starting heptad offsets. The heptad offsets are a, b, c, d, e, f, and g. This field is case sensitive as offsets are always indi-cated in lower case.

5.

Organism This field indicates the type of organism in which the protein occurs. The underlying database stores the organism’s genus name as well as the common name. For example, Homo Sapi-ens (Human). This search is case insensitive as well.

6.

Cluster 3, 4, 5, 6, 6 plus

As previously described, these fields provide the information on the number of clusters in the coiled-coils. If a cluster has three consecutive 1’s, the cluster3 count is increased by one. If a cluster has four consecutive 1’s, the cluster4 count is in-creased by one, and so on.

7.

De-cluster 3, 4, 5, 6, 6 plus

These fields provide the information on the number of de-clusters in the coiled-coils. If a de-cluster has three consecu-tive 0’s, the de-cluster3 count is increased by one. If a de-clus-ter has four consecutive 0’s, the de-cluster4 count is increased by one, and so on.

Table 4-11: Search parameters used on the Coiled-coil Related Search Page

This page does not perform motif searching. For motif searching the users can go to the Coiled-coil Motif Searching page.

Ankur S. Deshmukh


4.3.4 COILED-COIL MOTIF SEARCHING

Figure 4-28: Coiled-coil Motif Search web page

This webpage allows users to perform motif searches on a coiled-coil. A motif search finds the sequence within a given coiled-coil where, given part of the coiled-coil sequence and the starting heptad offset, we find the matching sequence in the coiled-coil which starts with the heptad offset provided. This means that the program searches for the occurrence of certain amino acids residues at given heptad positions. The webpage also provides links to proteins and the salt bridges. The results also include the starting coiled-coil offset and the clusters in the coiled-coil. It also returns the total number of matches found. The matches are highlighted in yellow to be used as a visual aid. This search can be slow as the query needs to go through all the coiled-coils and find the number of matches. The following parameters determine which fields the users can search.

Description of the search fields for the Coiled-coil Motif Search Page

#Search Pa-

rameter Name


1.

Coil Se-quence

This field can be used to search a string of amino acids occur-ring in the coiled-coil at a certain heptad offset location. You can also search for a sequence such as 'L__L' occurring at off-set 'a' in the coiled-coil where the underscore represents any amino acid. This field is case sensitive as coiled-coil se-quences are always indicated in upper case.

2.

Coil Offsets This field provides the start offset for the coil sequence match. For executing a query, both of these fields need to have valid inputs; otherwise, you may get an invalid input error.

Table 4-12: Search parameters used on the Coiled-coil Motif Search Page

Ankur S. Deshmukh


4.3.5 COIL HEPTAD AND SALT BRIDGE SEARCH

Figure 4-29: Coiled Heptad and Salt Bridge Search Page

This webpage provides information on what salt residues occur in which heptad of the coiled-coil. The salt bridges in this table have been identified using tblSaltResiduesLookup table. The results include the coil sequence, the coil starting offset, the heptad in which the salt bridge occurs, the type of interaction, the type of salt bridge and the location of the heptad in the coiled-coil. The a and d offsets in the coil sequence are highlighted in red while the heptad offset is highlighted in yellow. The following parameters determine which fields the users can search.

Ankur S. Deshmukh


Description of the search fields for the Coiled Heptad Salt Bridge Search Page

# Search Parameter Name


1.

Coil Offset This field allows you to search for coiled-coils with certain starting heptad offsets. The heptad offsets are a, b, c, d, e, f, and g. This field is case sensitive as offsets are always indicated in lower case.

2.

Offsetg, Offseta, Off-setb, Offsetc, Offsetd, Offsete, and Offsetf

These fields are used to search heptad offsets contain-ing certain amino acids. You can search on the occur-rence of an amino acid in a heptad.

3.

Interaction Type The Interaction Type determines whether the electro-static interaction between the salt bridges is an At-traction or Repulsion.

4.

Salt Bridges Type This field allows users to select between the three dif-ferent types of salt bridges: i to i + 3, i to i + 4 and i to i + 5.

Table 4-13: Search parameters used on the Coil Heptad and Salt Bridge Search Page

Ankur S. Deshmukh


4.3.6 GENERATED REPORTS

The reports are based on the data obtained from the Stable Coil Database. Most of the generated reports have materialized views as their backend. These materialized views are created using complex SQL statements, sometimes created by joining two or more tables. There are various factors affecting the stability of a coiled-coil; the generated reports try to highlight these factors, thus providing an insight into the variations of the coiled-coil data. The performance of each report is similar due to the fact that they are not views, but tables in general. Hence, by creating indexes on these materialized views, we can make searches as fast as the hardware and the database permit. The sections below describe the generated reports in detail.

Amino Acid Pair Occurrences in Coiled-coils This report provides the frequency of occurrence of pairs of amino acids in the coiled-coils and the heptad positions at which these amino acids occur. For this report the heptad is defined as gabcdef, so that i to i’ + 5 salt bridges which generally occur in g and e heptad offsets can be eas-ily identified. The researchers are interested only in amino acid pairs which involve the following heptads:

1. The d-e pair of residues helps users identify the most frequently occurring amino acid ad-jacent to the hydrophobic core offset d.

2. The g-a pair of residues helps users identify the most frequently occurring amino acid ad-jacent to the hydrophobic core offset a. The results returned by the g-a and d-e searches aid in understanding what kind of amino acid residues (hydrophobic, hydrophilic or elec-trically charged) occur adjoining the hydrophobic core and how they affect the overall sta-bility of the coiled-coil.

3. The a-d pair of residues account for the hydrophobic core of the coiled-coil. The amino acids occurring in these positions are usually hydrophobic. The results from this report will help the users understand what residues occur frequently in the hydrophobic core, thus ex-plaining the significance of a particular amino acid to stability of coiled-coil through hy-drophobic interactions

4. The g-e pair of residues relates to i to i’ + 5 electrostatic interactions between coiled-coils.

5. The e-g pair of residues relates to i to i’ + 2 electrostatic interactions between coiled-coils. The difference between these residues and the residues described above is that they are not retrieved from the same heptad. The residue at offset e is retrieved from one heptad and the residue at offset g is retrieved from the following heptad.

The electrostatic interactions occur between two different α-helical coiled-coils due to the differ-ence in the ionic charge on the amino acids involved in the α-helix. This is yet another factor that contributes to the stability of the coiled-coil. Hence, it is important to know the amino acids that frequently participate in these interactions.

These results can be filtered on amino acid pairs, the type of heptad and the number of amino acid pairs. The page also has page aggregate and overall aggregate values at the bottom of the page.

Ankur S. Deshmukh


Amino Acid Occurrences in Salt Bridges A salt bridge is an electrostatic interaction occurring between two different α-helical coiled-coils. The researchers at UCHSC are only interested in intrachain i to i + 3, intrachain i to i + 4 and in-terchain i to i’ + 5 interactions. The report generated provides the frequency of occurrence of the salt bridges (indicated in Table 4-1) and residues that occur between the salt bridges. The results can be filtered based on the type of salt bridge, the salt bridge offset, and the offsets that occur within the salt bridge. With these results, the users can determine what salt bridges, be they attrac-tions or repulsions, affect the coiled-coil stability.

Frequency of Occurrence of i to i+n Salt Bridges based on Amino Acids at Heptad Offset a/d There are six reports under this category, one each for the i to i + n (intrachain i to i + 3, intrachain i to i + 4 and interchain i to i’ + 5) electrostatic interaction based on amino acids at either heptad offset a or d. This report provides information on the occurrence of a particular amino acid in a heptad and whether or not there is a salt bridge present in the same heptad. This information is critical to users who want to know the relationship between hydrophobic and electrostatic interac-tions. From the results, the users are able to interpret how many times a strong hydrophobic inter-action occurs when in the presence of a salt bridge.

Cluster Count vs. Coiled-coil Length This report gives the number of clusters found in coiled-coils of certain lengths. The coiled-coils are divided into seven different groups depending on their lengths.

1. Coiled-coils with length less than 50 amino acids2. Coiled-coils with length between 50 and 59 amino acids3. Coiled-coils with length between 60 and 69 amino acids4. Coiled-coils with length between 70 and 79 amino acids5. Coiled-coils with length between 80 and 89 amino acids6. Coiled-coils with length between 90 and 99 amino acids7. Coiled-coils with length greater than 100 amino acids.

The report provides information on number of coiled-coils of a particular length, the number of stabilizing/destabilizing clusters in these coiled-coils and the number of stabilizing/destabilizing clusters in coiled-coils of a particular length which actually has clusters. The stability of the coiled-coil is greatly affected by the strength of the hydrophobic interactions present in the coiled-coil. An increase in the number of stabilizing clusters in a coiled-coil is associated with an increase in the stability of the coiled-coil and vice versa for destabilizing clusters. Hence, it is necessary to understand the distribution of clusters within the coiled-coils.

Destabilizing/Stabilizing Cluster Distribution in Coiled-coilsThis report gives the number of destabilizing/stabilizing clusters found in coiled-coils of certain lengths. The coiled-coils are classified by length as follows.

1. Coiled-coils with length less than 50 amino acids2. Coiled-coils with length between 50 and 59 amino acids3. Coiled-coils with length between 60 and 69 amino acids4. Coiled-coils with length between 70 and 79 amino acids5. Coiled-coils with length between 80 and 89 amino acids6. Coiled-coils with length between 90 and 99 amino acids7. Coiled-coils with length greater than 100 amino acids.

Ankur S. Deshmukh


The destabilizing/stabilizing clusters are also classified by length as follows.

1. De-clusters/clusters of length 32. De-clusters/clusters of length 43. De-clusters/clusters of length 54. De-clusters/clusters of length 65. De-clusters/clusters of length 7 or more.

The stability of the coiled-coil is affected by the strength of the hydrophobic interactions present in the coiled-coil. If a coiled-coil contains destabilizing clusters of larger lengths, the stability of the coiled-coil is adversely affected. Similarly, the number of stabilizing clusters of higher lengths in a coiled-coil greatly improves the stability of the coiled-coil. Hence, it is necessary to understand the distribution of de-clusters/clusters of each length, within a coiled-coil.

Occurrence of Amino Acid in Offset a/d with respect to the Location of that Occurrence in the Coiled-coil These reports provide information on how many times amino acids occur in a coiled-coil at heptad offset a/d and at what position they occur in the coiled-coil. The position of the amino acids is di-vided into three different groups.

1. At the beginning of the coiled-coil2. In the center of the coiled-coil3. At the end of the coiled-coil.

These reports aid in understanding whether the amino acids involved in the hydrophobic core of the coiled-coil are present closer to the N-terminus or the C-terminus or are buried deep in the coiled-coil.

Frequency of Occurrence of Amino Acids in Coiled-coils

This report provides information on the number of occurrences of amino acids in the coiled-coils. This report helps researchers understand the most common occurring residues in the coiled-coils. As hydrophobic amino acids are important to the stability of the coiled-coil, it has been hypothe-sized that frequently occurring amino acids would indeed be hydrophobic in nature

Ankur S. Deshmukh


Chapter 5

RESULTS

The major impetus in developing the Stable Coil Algorithm is to determine the presence of coiled-coils in proteins and to provide quantitative results that affirm the known factors affecting the stability of the coiled-coils. The reports based on the Stable Coil Database are intended to do exactly that. There are currently 87,368 proteins in the database, and 141,204 coiled-coils are found in these proteins. As mentioned earlier, the database is updated weekly by the Perl programs, and, as of this writing, the last update was performed on Monday, July 7th, 2008. The database is automatically updated using the Perl programs, thus eliminating any need for manual intervention. The generated reports provide the results from the database in which the researchers are interested in.

Amino Acid Pair

Offset Lo-cation 1

Offset Loca-tion 2

Offset Pair Occurrence

L-L A D 57,854I-L A D 41,768V-L A D 38,581L-I A D 26,098I-I A D 19,842F-L A D 19,830V-I A D 17,053L-V A D 16,359L-A A D 15,784N-L A D 15,170

Table 5-14: Top 10 amino acid pairs occurring in heptad offsets a and d which form the hydrophobic core

Amino Acid Pair

Offset Lo-cation 1

Offset Loca-tion 2


L-L D E 33,044L-E D E 26,198L-A D E 23,592L-K D E 21,898L-S D E 20,607L-Q D E 19,023L-R D E 18,939L-I D E 17,561L-V D E 17,339L-G D E 16,560

Table 5-15: Top 10 amino acid pairs occurring in heptad offsets d and e

Ankur S. Deshmukh


Amino Acid Pair

Offset Loca-tion 1

Offset Loca-tion 2


L-L G A 25,747A-L G A 19,562E-L G A 17,572L-V G A 16,772L-I G A 16,400S-L G A 16,150K-L G A 14,507I-L G A 14,463V-L G A 13,885A-V G A 13,752

Table 5-16: Top 10 amino acid pairs occurring in heptad offsets g and a

Amino Acid Pair

Offset Loca-tion 1

Offset Loca-tion 2


L-L G E 13,416A-L G E 9,608L-A G E 9,320E-K G E 8,389S-L G E 8,291L-S G E 8,207I-L G E 8,075L-I G E 7,841A-A G E 7,715L-G G E 7,285

Table 5-17: Top 10 amino acid pairs occurring in heptad offsets g and e usually associated with electrostatic attraction i to i’ + 5

Amino Acid Pair

Offset Loca-tion 1

Offset Loca-tion 2


L-L E G 15,394A-L E G 10,970L-A E G 10,514A-A E G 9,821S-L E G 9,508E-E E G 9,420L-I E G 9,063L-S E G 8,859I-L E G 8,854L-V E G 8,173

Ankur S. Deshmukh


Table 5-18: Top 10 amino acid pairs occurring in heptad offsets e and g usually associated with electrostatic attraction i to i’ + 2

The results in Table 5-1 to Table 5-5 show whether an amino acid pair occurs more frequently than others at certain heptad offsets. The heptad offsets that the researchers are interested in are d-e, g-a, a-d, g-e (i to i + 5 interaction) and e-g (i to i + 2 interaction).

The a-d pair of residues account for the hydrophobic core of the coiled-coil. The amino acids occurring in these positions are usually hydrophobic. The results from this report help the users understand what residues occur frequently in the hydrophobic core, thus explaining the significance of a particular amino acid to the stability of a coiled-coil through hydrophobic interactions. Hence, it comes as no surprise that a Leu-Leu amino acid pair occurs most frequently in coiled-coils at heptad positions a and d respectively, as Leu is an hydrophobic amino acid. This is congruent with the findings in [18, 10].

The g-e pair of residues relates to i to i’ + 5 electrostatic interactions while the e-g pair of residues relates to i to i’ + 2 electrostatic interactions between coiled-coils. The difference between these residues and the residues described above is that they are not retrieved from the same heptad. The residue at offset e is retrieved from one heptad and the residue at offset g is retrieved from the following heptad. For example: consider a coiled-coil sequence ALLDKTTREEKTRE starting at heptad offset g. The offsets in red are part of one heptad and the offsets in black are part of another heptad. The bolded offsets account for the e-g (i to i’ + 2) electrostatic interaction. The electrostatic interactions occur between two different α-helical coiled-coils due to the ionic charge on the amino acids involved in the α-helix. Although the effect of one electrostatic interaction plays a small part in the stability of the coiled-coil compared to the effects of hydrophobic interactions, many electrostatic interactions together can add up to have substantial effects. The results indicate that the most frequently occurring electrostatic attraction in heptad positions g and e is the attraction between Glu-Lys, where Glu is negatively charged and Lys is positively charged. The commonly occurring repulsion in heptad positions e and g is Glu-Glu.

The d-e pair of residues helps users identify the most frequently occurring amino acid adjacent to the hydrophobic core offset d, while the g-a pair of residues helps users identify the most frequently occurring amino acid adjacent to the hydrophobic core offset a. But there is also a more specific reason for identifying the amino acid residues at these offsets. Residues at positions e and g may form interhelical ion pairs that increase stability, but their contribution to stability is an order of magnitude less than the hydrophobic core. Thus, the results returned by the g-a and d-e searches aid in understanding what kinds of amino acid residues (hydrophobic, hydrophilic or electrically charged) aid in electrostatic interactions, what kinds of residues occur adjoining the hydrophobic core, and what effects these residues have on the overall stability of the coiled-coil. Our results indicate that the majority of residues occurring in offset d for the d-e pair are Leu, while the electrically charged residues Lys, Glu, and Arg appear in the top ten occurrences for residue e. As expected, Leu occurs frequently in heptad offset a in the g-a pair, while the top ten occurrences of offset g contain Lys and Glu.

Ankur S. Deshmukh


Amino Acid Pair

Offset Loca-tion 1

Offset Loca-tion 2


L-L A D 57,854I-L A D 41,768V-L A D 38,581L-L D E 33,044L-E D E 26,198L-I A D 26,098L-L G A 25,747L-A D E 23,592L-K D E 21,898L-S D E 20,607I-I A D 19,842F-L A D 19,830A-L G A 19,562L-Q D E 19,023L-R D E 18,939E-L G A 17,572L-I D E 17,561L-V D E 17,339V-I A D 17,053L-V G A 16,772L-G D E 16,560L-I G A 16,400L-V A D 16,359S-L G A 16,150I-L D E 15,853L-A A D 15,784L-T D E 15,461L-L E G 15,394N-L A D 15,170K-L G A 14,507

Table 5-19: Top 30 Amino Acid Pair Occurrences in Coiled-coils

These results, taken together without considering the heptad offsets, are not very useful; however, the results still help in understanding the commonly occurring amino acid residues in coiled-coils. The top 30 occurrences of pairs of amino acids occurring in coiled-coils is indicated in Table 5-6 above; it can be said that, of all the heptad offsets that interest researchers, Leu-Leu is the most commonly occurring pair of amino acid residue. This is closely followed by the Ile-Leu grouping in a and d offsets, because Ile, like Leu, is a hydrophobic residue, and hydrophobic residues are commonly found in offsets a and d, the hydrophobic core.

Ankur S. Deshmukh


Amino Acid Residue

Type of Heptad Off-set

Total Count of Amino Acids by Heptad Offset Loca-tion

L d 299,843L a 224,544I a 149,708V a 143,650I d 135,597L g 125,036L e 124,545L c 114,713L b 114,082L f 113,558A f 95,516A b 95,157A c 91,935A g 91,557A e 86,254S f 78,521F a 78,473S b 78,158S c 78,117E c 78,057E b 78,032E f 77,524V d 77,177E g 76,967K f 76,122E e 75,353K b 73,966I g 73,366I e 73,261S e 72,660

Table 5-20: Top 30 frequently occurring amino acids in the Stable Coil Database

The number of times a residue occurs individually in a coiled-coil is shown in the ‘Frequency of Occurrence of Amino Acids in Coiled-coils’ report. This report is indicates results similar to Tables 5-1 to Table 5-6. The results are depicted in Table 5-7. The results show that Leu is the dominantly occurring amino acid in coiled-coils in heptad offsets a and d. This is followed by other hydrophobic amino acids Ile, Val, and Phe. The biggest difference between the results in [17, 18, 11] and the results seen here is the that these refernces rank Met as the third most occurring amino acid, while our results put Met as one of the least occurring hydrophobic amino acid in a and d positions, even though Met has relatively high stability values associated with it (Table 3.1). This indicates that although most amino acids appear in a coiled-coil sequence in accordance with their associated stability values, there are

Ankur S. Deshmukh


some which act as an exception to the rule.

Figure 5-30: Coiled-coils Count vs. Coiled-coil Length

The project also illuminates the relationship between a coiled-coil and its length. A coiled-coil’s length is the number of amino acids in the coiled-coil. In this project we are interested in retrieving coiled-coils which contain 42 or more residues. From the results in Figure 5.1 it can be inferred that coiled-coils of length 60 or more are fairly uncommon. There is unfavorable entropy associated with chain length extension, which is not overcome by the increase in hydrophobic interactions associated with the increase in chain length, even if the heptad contained the most stabilizing hydrophobic residue (Leu) at position d and stabilizing ionic attractions. Thus, our findings are congruent with recent experiments [27] performed to study the effects of chain length on larger coiled-coils.

Ankur S. Deshmukh


Figure 5-31: Location of occurrence of amino acid within the coiled-coil when the amino acid is at heptad offset a.

Figure 5-32: Location of occurrence of amino acid within the coiled-coil when the amino acid is at heptad offset d.

The researchers were also interested in knowing where amino acids occur within the sequence. They would like to know whether the amino acid occurs at the beginning or in the center or at the end of the coiled-coil. The figures 5-2 and 5-3 above show the results of analyzing the coiled-coils as per their location. The occurrence of a coiled-coil is divided into three different types:

1. At the start of the coiled-coil: An amino acid is said to be at the start of the coiled-coil if it appears anywhere between the start of the coiled-coil sequence and ⅓ length of the coiled-coil.

2. At the center of the coiled-coil: An amino acid is said to be at the center of the coiled-coil if it

Ankur S. Deshmukh


appears anywhere between the ⅓ length of the coiled-coil and ⅔ length of the coiled-coil.3. Near the end of the coiled-coil: An amino acid is said to be near the end of the coiled-coil if it

appears anywhere between the ⅔ length of the coiled-coil and the end of the coiled-coil sequence.

These reports aid in understanding whether the amino acids involved in the hydrophobic core of the coiled-coil are present closer to the N-terminus or the C-terminus or are buried deep in the coiled-coil. As seen in the results above, the most common occurrence of an amino acid at any position is Leu which is congruent with our earlier findings. Leu occurs 80076 times near the end of the coiled-coil, 76146 times at the center of the coiled-coil, and 68322 times at the start of the coiled-coil. Also the top 10 oc-currences are all hydrophobic residues and the distribution of these amino acids is equally spaced be-tween start, center, and end locations. From these results it can be interpreted that hydrophobic residues are necessary for the formation and stability of the coiled-coil. Also, these results agree with those pro-vided in [20], which state that Leu is most frequently found in the core of the coiled-coils.

Figure 5-33: Normalized Value of Destabilizing Clusters in Coiled-Coils of particular length. Results obtained by dividing the total number of coiled-coils with de-clusters by the total number of de-clusters.

Ankur S. Deshmukh


Figure 5-34: Normalized Value of Stabilizing Clusters in Coiled-Coils of particular length. Results obtained by dividing the total number of coiled-coils with clusters by the total number of clusters.

Clusters offer further insights into the stability of coiled-coils. For this part of analysis, a convention of 0’s and 1’s are used to signify the hydrophobic amino acids (Phe, Leu, Ile, Met, Val and Thr) and non-hydrophobic amino acids, respectively. The stability of the coiled-coil is affected by the strength of the hydrophobic interactions present in the coiled-coil. If a coiled-coil contains destabilizing clusters of larger lengths, the stability of the coiled-coil is adversely affected. Similarly, the number of stabilizing clusters of higher lengths in a coiled-coil greatly improves the stability of the coiled-coil. Hence, it is necessary to understand the distribution of de-clusters/clusters of each length within a coiled-coil. The project provides various summary queries based on clusters in coiled-coils. The figures 5-4 and 5-5 display the distribution of destabilizing and stabilizing clusters across coiled-coils of varying lengths. As shown on these charts, the number of destabilizing clusters remains almost constant no matter what the length of the coiled-coil, while the number of stabilizing clusters continues to increase linearly as the coil length increases. As mentioned earlier, there is unstable entropy associated with increasing chain length, and hence, to keep the coiled-coils stable, we see an increase in the number of stabilizing clusters.

Ankur S. Deshmukh


Figure 5-35: Distribution of Destabilizing Cluster of Length 3 with respect to the Coiled-Coil length

Ankur S. Deshmukh




Ankur S. Deshmukh


Figure 5-39: Distribution of Destabilizing Cluster of Length 7+ with respect to the Coiled-Coil length

Ankur S. Deshmukh


Figure 5-40: Distribution of Stabilizing Cluster of Length 3 with respect to the Coiled-Coil length


Ankur S. Deshmukh




Ankur S. Deshmukh


Figure 5-44: Distribution of Stabilizing Cluster of Length 7+ with respect to the Coiled-Coil length

The results shown in Figures 5-6 thru 5-151 indicate that as the coil length increases, the number of clusters of any length in the coiled-coil decreases. The count of clusters in varying length coiled-coils gives an appreciation for the differences found. Also as we start looking into clusters with increasing lengths, the number of clusters found in the coiled-coils starts decreasing. The database is dominated by cluster lengths of 3 and 4. It can been seen that there are very few de-stabilizing clusters as compared to the stabilizing clusters, which determines that clusters are a necessary and vital ingredient to the stability of the coiled. The more stabilizing clusters a coiled-coil has the more stable it is. On the other hand, the increase in length destabilizing clusters as very little or no effect on the stability of the coiled-coil. Destabilizing clusters of length 3 or 4 are most destabilizing and addition of more residues to the cluster does not contribute in any way to the stability of the coiled-coil. Coiled-coils thus tend to have less destabilizing clusters as hydrophobic amino acids predominantly occupy the hydrophobic core. The findings form our database with regards to clusters is congruent with the experimental analysis done by Kwok and Hodges [27].

Type of Electro Static Interaction

Attraction /Repulsion

Salt Bridge

Salt Bridge Start Offset

Total Number of Salt Bridges In Coiled Coils

Intrachain i to i + 3 Attraction E..K b 8688Intrachain i to i + 4 Attraction E...K b 8663Intrachain i to i + 4 Attraction K...E f 8462Interchain i to i' + 5 Attraction E....K g 8430Intrachain i to i + 3 Attraction E..K f 8338Intrachain i to i + 3 Attraction E..K c 8216

Ankur S. Deshmukh


Intrachain i to i + 4 Attraction E...K e 8211Intrachain i to i + 3 Attraction K..E f 8128Intrachain i to i + 3 Attraction K..E G 8113Intrachain i to i + 4 Attraction E...K C 8060

Table 5-21: Top 10 electrostatic attractions present in the coiled-coils

Type of Electro Static In-teraction

Attraction /Re-pulsion

Salt Bridge

Salt Bridge Start Offset

Total Number of Salt Bridges In Coiled Coils

Intrachain i to i + 3 Repulsion E..E b 7926Intrachain i to i + 3 Repulsion E..E c 7797Intrachain i to i + 3 Repulsion E..E g 7725Intrachain i to i + 3 Repulsion E..E f 7648Intrachain i to i + 4 Repulsion E...E b 7600Intrachain i to i + 4 Repulsion E...E e 7448Intrachain i to i + 4 Repulsion E...E f 7030Intrachain i to i + 4 Repulsion E...E c 6904Intrachain i to i + 3 Repulsion K..K f 6772Intrachain i to i + 3 Repulsion K..K c 6632

Table 5-22: Top 10 electrostatic repulsions present in the coiled-coils

Figure 5-45: Relationship of amino acids in offset a to an i to i + 3 salt bridge

Ankur S. Deshmukh


Figure 5-46: Relationship of amino acids in offset a to an i to i + 4 salt bridge

Ankur S. Deshmukh


Figure 5-47: Relationship of amino acids in offset a to an i to i’ + 5 salt bridge

Figure 5-48: Relationship of amino acids in offset d to an i to i + 3 salt bridge

Ankur S. Deshmukh


Figure 5-49: Relationship of amino acids in offset d to an i to i + 4 salt bridge

Figure 5-50: Relationship of amino acids in offset d to an i to i’ + 5 salt bridge

Ankur S. Deshmukh


The other part of the project revolves around the determination of salt bridges and the heptads in which these salt bridges occur. Electrostatic interactions in proteins are extremely complex because they can exist on the fully exposed surface of the proteins, fully buried in the interior of the protein, or in a partially buried environment with varying degrees of hydrophobicity of the surrounding residues. In addition, buffer conditions can have dramatic effects on the contributions of the ion pairs to stability. Hence it is important to find the salt bridges in the coiled-coils and find their relationship to the coiled-coils and coiled-coil heptads. The Table 5-8 and 5-9 shows the top ten amino acids that occur in the electrostatic interactions. Lys-Glu is most commonly occurring salt attraction, while Glu-Glu occurs most frequently in repulsions. The reason being Glu has a high negative charge while Lys is highly positive. Lys-Glu attraction contributes 0.4 kcal/mol stability while Glu-Glu repulsion destabilizes the coiled-coil by about 0.45 kcal/mol. The balance between the attractions and repulsions is critical for specifying coiled-coil dimerization in terms of parallel vs. anti-parallel orientation. This is also because coiled-coils are found to be more stable in lower pH than neutral pH [21] and Glu is much less stable for the coiled-coil when it is involved in an electrostatic repulsion. Also it is interesting to note that Pro never occurs in salt bridges as Pro is non-polar and highly destabilizing to the coiled-coil. The charts in Figures 5-16 to 5-21 suggest the relationships between amino acids in the hydrophobic core in coiled-coil heptads and salt bridges. As can be seen, occurrence of Leu in and around salt bridges far outweighs occurrence of any other amino acids. This was already assumed to be true as the Leu is most commonly occurring residue in the coiled-coil database. Also some residues occur more frequently in offset a than in offset d or vice versa when they are in presence of the salt bridge. For example, Ala occurs more frequently in offset d than offset a when in a presence of a salt bridge. This is because there is a symbiotic relationship between electrostatic interactions and hydrophobic interactions in coiled-coils. Hence it can be safely said that certain hydrophobic residues form better associations with coiled-coils when it is in one position of the hydrophobic core that the other. These interactions do not show any reduction in stability of the coiled-coil even when they come in contact with polar solutions.

Ankur S. Deshmukh


Chapter 6

CONCLUSION

Proteomics research has increased over the last couple of years due to the appearance of new viruses like SARS and due to an increase in understanding of older viruses like AIDS. Twenty-five years since it was first identified, there is still no cure for Acquired Immune Deficiency Syndrome. There is much more to understand about older proteins like Tropomyosin, Myosin, and Kinesin in regards to their functions and structures. Hence, it is necessary to create better prediction algorithms, to save time and resources. This project is a humble small step in this regard. This project is written to facilitate researchers at UCHSC and elsewhere in the study the coiled-coil domain and try to understand the reasons behind the stability of these coiled-coils. The program currently provides information on hydrophobic and electrostatic interactions and provides outputs in three different formats so that the data is easily accessible and exported. In addition, cluster theory proposed by [24] is explored to determine what clusters occur more frequently in coiled-coils.

The Stable Coil Algorithm could be improved to find coiled-coils of smaller lengths and may be also have a dynamic window size. The researchers are also interested in finding other electrostatic interactions such as i to i’ + 2 attractions and repulsions. On the software side the timing of the searches can be improved by moving the database from MySQL to Oracle. Oracle has highly developed stored procedure routines and analytic functions like LAG and LEAD which would reduce the time of the searches by almost half. Oracle however is an expensive option in comparison to MySQL as MySQL is free.

This project would not have been possible without the direct support and guidance from Dr. Robert Hodges and Paul Kirwan at University of Colorado Health Sciences Centre and Dr. Jugal Kalita at University of Colorado, Colorado Springs. Their help to test the results and analyze them help make this project what it is today.

Ankur S. Deshmukh


CHAPTER 7

REFERENCES

[1] “Amino Acid,” Wikipedia. (http://en.wikipedia.org/wiki/Amino_acid)

[2] “Beta Sheets,” Wikipedia. (http://en.wikipedia.org/wiki/Beta_sheets)

[3] Brinkmann, D., Nandoor, S., Kalita, J., Tripet, B., and Hodges, R.S., “CoCoLysis: A Web-Acces-sible Colied Coil Protein Database with Analysis Tools,” Symposium on Bioinformatics and Biotechnol-ogy (BIOT-04), pp. 73-76, September 2004.

[4] Nandoor, S., Kalita, J., Tripet, B., and Hodges, R.S., “Cocolysis: Coiled-coil Database,” Sympo-sium on Bioinformatics and Biotechnology (BIOT-05), pp. 25-28, October 2005.

[5] Tripet. B, “Coiled-coil presentation, University of Colorado Health Sciences Center,” 2003

[6] Bornberg-Bauer, E., Rivals, E., Vingron, M., “Computational approaches to identify leucine zip-pers,” Nucleic Acids Research, vol. 26, no. 11, pp. 2740-2746, 1998.

[7] Hodges, R. S., “De novo design of α-helical proteins: basic research to medical applications ” Biochem. Cell Bio., vol. 74, pp. 133-154 1995.

[8] Sander, C. Kabsch W., “Dictionary of protein secondary structure: pattern recognition of hydro-gen- bonded and geometrical features,” Biopolymers vol. 22, pp. 2577-2637, 1983.

[9] Tripet, B., Wagschal, K., Lavine, P., Mant, C., Hodges, R “Effects of Side Chain Characteristics on Stability and Oligomerization State of a de Novo designed Model Coiled -coil: 20 Amino Acid Substi-tutions in Position d,” Journal of Molecular Biology vol. 300, pp. 377-402, 2000.

[10] “ExPASy (Expert Protein Analysis System) proteomics server,” Swiss Institute of Bioinformatics (SIB).

[11] Him, J.-H., Steif, C., Vogl, T., Meyer, R., Renner, M., Ledermiiller, R., “Fundamentals of protein Stability ” Pure & Applied Chern, vol. 65, no. 5, pp. 947-952, 1993.

Ankur S. Deshmukh


[12] Burkhard P., Ivaninskii, C. and Lustig, “Improving Coiled-coil Stability by Optimizing Ionic In-teractions.,” Journal of Molecular Biology., vol. 318, pp. 901-910 2002.

[13] Ontario Centre for Genomic Computing.

[14] Crick, F. H. C., “The packing of α-helices - simple coiled-coils.,” Acta Crystallogr, vol. 6, pp. 689-697 1953.

[15] Berger, B., Wilson, D.B., Wolf, E., Tonchev, T., Milla, M., and Kim, P.S, “Predicting Coiled-coils by Use of Pairwise Residue Correlations,” Proceedings of the National Academy of Science USA, vol. 92, pp. 8259-8263., 1995.

[16] “Proteomics”, Wikipedia. (http://en.wikipedia.org/wiki/Proteomics)

[17] Wagschal, K., Lavigne, P., Mant, C., Hodges, R.,, “The role of position a in determining the sta-bility and oligomerization state of alpha-helical coiled-coils: 20 amino acid stability coefficients in the hydrophobic core of proteins,” Protein Science, vol. 8, no. 2312-2329, 1999.

[18] Kohn, W. D., Cyril, M.C., Hodges, R.S., “Salt effects on protein stability: Two stranded α-Helical Coiled-coils Containing Inter- or Intrahelical ion paiR,” Journal of Molecular Biology., vol. 267, pp. 1039-1052, 1997.

[19] Walshaw J, Woolfson D.N., “SOCKET: a program for identifying and analysing coiled-coil mo-tifs within protein structures,” Journal of Molecular Biology, vol. 307, no. 5, pp. 1427-50, 2001.

[20] Lupas, A., Van Dyke, M., and Stock, J., “Predicting Coiled-coils from Protein Sequences”, Sci-ence vol. 252: pp. 1162-1164, 1991

[21] Lupas, A. “Prediction and Analysis of Coiled-Coil Structures”, Meth. Enzymology vol. 266: pp. 513-525,

[22] Wolf, E. Kim, P.S., and Berger, B. “MultiCoil: A program for predicting two- and three-stranded coiled-coils”., Protein Science vol. 6 pp. 1179-1189 1997

[23] “DNA” Wikipedia. (http://en.wikipedia.org/wiki/DNA)

[24] Kwok, S.C., and Hodges, R.S. “Stabilizing and Destabilizing Clusters in the Hydrophobic Core of

Ankur S. Deshmukh


Long Two-Stranded α-Helical Coiled-coils”, Journal of Biological Chemistry, vol. 279, no. 20, pp. 21576-21588, 2004.

[25] “Intermolecular force”, Wikipedia. (http://en.wikipedia.org/wiki/Intermolecular_force)

[26] “Protein Structure”, Wikipedia. (http://en.wikipedia.org/wiki/Image:Protein-structure.png)

[27] Kwok, S.C., and Hodges, R.S. “Effect of chain length on coiled-coil stability: Decreasing stability with increasing chain length”, Peptide Science, vol. 76, no. 5, pp.378-390, 2004.

Ankur S. Deshmukh


APPENDIX A: CREATING MATERIALIZED VIEWS IN MYSQL

Ankur S. Deshmukh


The script below describes the creation of materialized views in MySQL. Materialized views are not supported in MySQL, but by scheduling CREATE TABLE AS statements with dynamic SQL queries they can be created using the code below. The code also supports creation of primary keys and unique constraints on the tables on the fly.

DROP PROCEDURE IF EXISTS SP_CREATEMATERIALIZEDVIEW;DELIMITER //

CREATE PROCEDURE SP_CREATEMATERIALIZEDVIEW( IN i_sourcetable VARCHAR(150), IN i_primarykey VARCHAR(150), IN i_targettable VARCHAR(150), IN i_sqlconvert TEXT)LANGUAGE SQL NOT DETERMINISTICCONTAINS SQLSQL SECURITY DEFINERMODIFIES SQL DATA------------------------------------------------------------------------------------------- Procedure Name : SP_CREATEMATERIALIZEDVIEW ---- Inputs : The source table from which we retreieve the data, the ---- primary key on the destination table, the target table ---- to load data into, and the sql to extract data from source ---- and store it into the target ---- Ouput : The procedure loads data from the source table into the ---- destination table. The procedure can be executed either on ---- a trigger or using the crontab ----------------------------------------------------------------------------------BEGIN

---- Declaring variables required to create a materialized view--DECLARE v_dropmview_sql TEXT;DECLARE v_createmview_sql TEXT;DECLARE v_addprimarykey_sql TEXT;

---- Creating the sql queries to drop the materialized view, create the materialized-- view and add a primary key to the materialized view.--SET @v_dropmview_sql := CONCAT('DROP TABLE IF EXISTS ', i_targettable);SET @v_createmview_sql := CONCAT('CREATE TABLE ', i_targettable , ' ' , i_sqlconvert);SET @v_altermview_sql := CONCAT('ALTER TABLE ', i_targettable , ' ' , 'CONVERT TO CHARACTER SET latin1 COLLATE latin1_bin');

Ankur S. Deshmukh


IF i_primarykey IS NOT NULLTHEN SET @v_addprimarykey_sql := CONCAT('ALTER TABLE ', i_targettable , ' ADD PRIMARY KEY (', i_primarykey ,')');END IF;

---- Dropping the materialized view if it exists--PREPARE stmt1 FROM @v_dropmview_sql;EXECUTE stmt1;

---- Creating the table which will act as the materialized view--PREPARE stmt2 FROM @v_createmview_sql;EXECUTE stmt2;

---- Converting the character set of the materialized view--PREPARE stmt3 FROM @v_altermview_sql;EXECUTE stmt3;

---- Adding the primary key to the table if specified--IF i_primarykey IS NOT NULLTHEN PREPARE stmt4 FROM @v_addprimarykey_sql; EXECUTE stmt4;END IF;

---- Deallocate the preprae statements and exit--DEALLOCATE PREPARE stmt1;DEALLOCATE PREPARE stmt2;DEALLOCATE PREPARE stmt3;

IF i_primarykey IS NOT NULLTHEN DEALLOCATE PREPARE stmt4;END IF;

END; //

Ankur S. Deshmukh


APPENDIX B: SQL QUERIES FOR CREATING MATERIALIZED VIEW

Ankur S. Deshmukh


Here we take a look at the some of the complex queries that are used in creating the materialized views:

FREQUENCY OF OCCURRENCE OF AMINO ACIDS IN AND AROUND i to i + 3 SALTBRIDGE

SELECT COALESCE(cnt_out.offseta, cnt_in.offseta) AS "Offset A", 'Intrachain i to i + 3' AS "Type OF SaltBridge", 'Attraction' AS "Interaction Type", COALESCE(cnt_out.cnt, 0) AS "Heptads Without Salt Bridges", COALESCE(cnt_in.cnt, 0) AS "Heptads With Salt Bridges", (COALESCE(cnt_out.cnt, 0)/(COALESCE(cnt_out.cnt, 0) + COALESCE(cnt_in.cnt, 0))) AS "% of Heptads Without SaltBridges", (COALESCE(cnt_in.cnt, 0)/(COALESCE(cnt_out.cnt, 0) + COALESCE(cnt_in.cnt, 0))) AS "% of Heptads With SaltBridges" FROM (SELECT tshc.offseta, count(1) cnt FROM tblSplitHeptadCoils tshc WHERE tshc.heptadoffsetid NOT IN (SELECT ths.heptadoffsetid FROM tblHeptadSalt ths, tblSaltBridge tsb, tblSaltResiduesLookup tsrl WHERE ths.saltbridgeid = tsb.saltbridgeid AND tsb.saltresidueid = tsrl.saltresidueid AND tsrl.interactiontype = 'A' AND tsrl.saltbridgestype = 'Intrachain i to i + 3' ) GROUP BY tshc.offseta ) cnt_out LEFT OUTER JOIN (SELECT tshc.offseta, count(1) cnt FROM tblSplitHeptadCoils tshc

WHERE tshc.heptadoffsetid IN (SELECT ths.heptadoffsetid FROM tblHeptadSalt ths, tblSaltBridge tsb, tblSaltResiduesLookup tsrl WHERE ths.saltbridgeid = tsb.saltbridgeid AND tsb.saltresidueid = tsrl.saltresidueid AND tsrl.interactiontype = 'A' AND tsrl.saltbridgestype = 'Intrachain i to i + 3' ) GROUP BY tshc.offseta ) cnt_in ON cnt_out.offseta = cnt_in.offsetaWHERE COALESCE(cnt_out.offseta, cnt_in.offseta) <> ''';

Ankur S. Deshmukh


When we replace the i to i + 3 interaction with i to i + 4 or i to i’ + 5, we can generate materialized views which provide us with information on what amino acid residues occur in ‘a’ position and have salt bridges in the heptad (gabcdef).

COUNT OF TYPES OF CLUSTERS IN COILED-COILS

The SQL query below creates materialized views, which are used to identify any relationship between the length of a coiled-coil and stabilizing clusters. If we replace the cluster with de-cluster we can retrieve information of the relationship between destabilizing clusters and coiled-coil length

SELECT CASE WHEN coillength < 50 THEN 'Coiled-coils with coil length less than 50' WHEN coillength >= 50 AND coillength < 60 THEN 'Coiled-coils with coil length between 50 and 59' WHEN coillength >= 60 AND coillength < 70 THEN 'Coiled-coils with coil length between 60 and 69' WHEN coillength >= 70 AND coillength < 80 THEN 'Coiled-coils with coil length between 70 and 79' WHEN coillength >= 80 AND coillength < 90 THEN 'Coiled-coils with coil length between 80 and 89' WHEN coillength >= 90 AND coillength < 100 THEN 'Coiled-coils with coil length between 90 and 99' WHEN coillength >= 100 THEN 'Coiled-coils with coil length greater than 100' END AS coiled_coil_by_length, SUM(cluster3) AS count_3_cluster, SUM(cluster4) AS count_4_cluster, SUM(cluster5) AS count_5_cluster, SUM(cluster6) AS count_6_cluster, SUM(cluster6p) AS count_6p_cluster, (SUM(cluster3) / (SUM(cluster3) + SUM(cluster4) + SUM(cluster5) + SUM(cluster6) + SUM(cluster6p))) AS percent_3_cluster, (SUM(cluster4) / (SUM(cluster3) + SUM(cluster4) + SUM(cluster5) + SUM(cluster6) + SUM(cluster6p))) AS percent_4_cluster, (SUM(cluster5) / (SUM(cluster3) + SUM(cluster4) + SUM(cluster5) + SUM(cluster6) + SUM(cluster6p))) AS percent_5_cluster, (SUM(cluster6) / (SUM(cluster3) +

Ankur S. Deshmukh


SUM(cluster4) + SUM(cluster5) + SUM(cluster6) + SUM(cluster6p))) AS percent_6_cluster, (SUM(cluster6p) / (SUM(cluster3) + SUM(cluster4) + SUM(cluster5) + SUM(cluster6) + SUM(cluster6p))) AS percent_6p_cluster, count(1) AS total_clusters FROM tblCoiledCoilGROUP BY coiled_coil_by_length

LOCATION OF AMINO ACIDS WITHIN THE COILED-COIL

The query below provides an estimate as to where how many times a particular amino acid is located in the coiled-coil in a given positions. The positions are divided into three different groups:

1. At the start of the coiled-coil – From start to coil length /3 -12. At the center of the coiled-coil – From coil length/3 to coil length * 2/3 – 13. Near the end of the coiled-coil – From coil length * 2/3 to coil end.

SELECT heptad.offsetd ,CASE WHEN amino_acid_location BETWEEN 1 AND ROUND(coillength / 3) - 1 THEN ''At the start of the coiled-coil'' WHEN amino_acid_location BETWEEN ROUND(coillength / 3) AND ROUND(coillength * (2 / 3)) - 1 THEN ''At the center of the coiled-coil'' WHEN amino_acid_location BETWEEN ROUND(coillength * (2 / 3)) AND coillength THEN ''Near the end of the coiled-coil'' END located_where , count(*) total_count FROM (SELECT coilid ,offsetd ,FIND_IN_SET(CONCAT(offsetd, ''d''), CONCAT_WS('','', CONCAT(tshc.offsetg, ''g''),CONCAT(tshc.offseta, ''a''),CONCAT(tshc.offsetb, ''b''),CONCAT(tshc.offsetc, ''c''),CONCAT(tshc.offsetd, ''d''),CONCAT(tshc.offsete, ''e''),CONCAT(tshc.offsetf, ''f'')) ) + tshc.heptadstartloc - 1 amino_acid_location FROM tblSplitHeptadCoils tshc WHERE TRIM(offsetd) <> ''''

Ankur S. Deshmukh


) heptad, tblCoiledCoil tcc WHERE tcc.coilid = heptad.coilidGROUP BY heptad.offsetd, located_where, short_nameORDER BY heptad.offsetd ASC, total_count DESC;

Ankur S. Deshmukh


APPENDIX C: INSTALLATION OF SOFTWARES THE PROJECT

Ankur S. Deshmukh


The project uses Perl 5.8.8 and MySQL 5.0 as its back end, while the front end is designed using PHP 5.2.5 server-side pages which run on the Apache server. The testing of the project was done on writers Windows XP desktop, using Microsoft® IIS server. The desktop has 256 MB of RAM and hence the server was slow to respond on the test box. It should be noted that IIS servers cache the results even when PHP scripts explicitly state no-cache. Hence the dynamic charts on the website do not refresh even when the searches are performed. Listed below are some of modules crated and utilized during the course of this project.

For the backend operations, following six home grown Perl modules facilitate in the scraping and loading of the data:

1. DbiUtilities.pm: The module has four sub routines that carry out the database connect and disconnect work.

Subroutine’s Name Subroutine’s Function

dbConnect Connects to specified database using the specified username

dbDisconnect Disconnects from the specified database

generateDBError Provides errors messages on any database related exceptions

2. MySQLInstances.pm: The module just holds a global variable that stores the passwords for various MySQL accounts. The passwords are encrypted using CryptDatabase.pm module.

3. CryptDatabase.pm: The module has two sub routines that carry out the encryption and decryption of the database passwords

Subroutine’s Name Subroutine’s Functionencrypt Encrypts the database passwordsdecrypt Decrypts the database passwords

4. Alert.pm: The module has three sub routines that send out alert emails based on the type of message.

Subroutine’s Name Subroutine’s FunctionfailureAlert Sends failure noticescompletionAlert Sends completion noticescustomAlert Sends custom notices

5. Log.pm: The module has four sub routines which log error or success messages with appropriate severity.


initializeLog Creates a unique log file name using the file name provided

Ankur S. Deshmukh


writeToLog Write messages (error, warning, success) to the log file

logTabularDisplay Returns a string in tabular format (can be used for logging files and their error messages in a table)

finalizeLog Closes the file handle which was opened to log the messages

6. Time.pm: The module has four sub routines that provide date and time date in varying manners.


now Gets all the date time information for the current day

getDateStamp Provides the date stamp for the current day

getHourStamp Provides the hour stamp for the current day

getDate Provides the current, past and historical dates based on arguments

The front end interfaces with the backend using the PHP server-side pages using the mod_php module on the Apache server. The PHP pages use the underlying tables, views and materialized views do search and display their results. These pages use regular expressions regularly, particularly in the case of coiled-coil motif search. Each of these pages incorporates a help page explaining the results of the searches and the fields on which the user can search on.

Ankur S. Deshmukh

data mining of electrostatic interactions between …jkalita/work/studentresearch/deshmukhms... ·...

Documents