use of neural networks to predict and analyze membrane proteins in the proteome

Intelligent Systems Research Centre

Use of Neural Networks to Predict and Analyze Membrane Proteins in the Proteome

Subrata Kumer Bose


London Metropolitan University, UK


Abstract

• Transmembrane (TM) proteins are one of the most understudied groups of proteins in biochemical research, because of the technical difficulties of obtaining structural information about transmembrane regions. 3D structures of proteins derived by X-ray crystallography have been determined for about 15000 proteins, but only about 30 of these are transmembrane proteins, despite the fact that TM proteins may account for about 30% of the proteome. This project seeks to make a contribution to knowledge and understanding in the field of neural networks, through the development of a particular area of theory and application of a novel methodology. The project seeks to develop software for analysing protein sequences for the presence of membrane spanning regions using artificial neural network approaches. The expected benefits include an increased understanding of how to create and train optimal neural networks for membrane protein datasets, which will be extremely useful in both academia and industry.


Introduction

• Bioinformatics is the application of computer technology to the management of biological information. Computers are used to gather, store, analyze and integrate biological and genetic information which can then be applied to gene-based drug discovery and development.

BioinformaticsBioinformatics


Introduction Contn..

• In recent years, many bioinformaticians have researched into the prediction of globular proteins, which is roughly about 75% of the whole proteome

• However, membrane proteins, which are 20-30% of the proteome offer more novel targets for newer drug developments, are largely ignored

Membrane Proteins

Membrane Proteins


Introduction contn…

• Data mining (or more precisely, knowledge extraction) can be described as the process of discovering previously unknown dependencies and relationships in data sets.

• A (learning) system may discover salient features in the input data whose importance was not previously recognized.

• It is now established that algorithms can be designed which extract understandable representations from trained neural networks, enabling them to be used for data mining (Browne, A., 2004)

Data MiningData Mining


Introduction contn…

• In the past, most data mining has been performed using symbolic artificial intelligence data algorithms such as C4.5 and C5 or CART.

• Neural Networks (NNs) have in the past been treated as ‘black boxes’: systems unable to explain the process by which a decision or output has been reached.

Knowledge Extraction

Knowledge Extraction


Objectives of the investigation

• The project seeks to develop software for analysing protein sequences for the presence of membrane spanning regions using artificial neural network approaches. Beyond simply identifying membrane spanning regions the approach would be used to analyse biologically useful subsets of proteins with membrane spanning regions, which would include:

• (i) The large family of G-protein coupled receptors (GPCRs). These form an important group of drug targets of interest to the pharmaceutical industry, and are the site of interaction of many hormones, neurotransmitters and other chemical stimuli around the body. Attempts have been made to develop methods for predicting coupling specificity of GPCRs using Hidden Markov Matrices (Möller et al. 2001b). and the project would extend work in this area.



(ii) Membrane proteins with distinct cellular locations. Prediction of the localization of membrane proteins to the Golgi apparatus has been attempted (Yuan and Teasdale 2002) and it would be useful to attempt analysis of proteins localized to other membrane compartments, such as plasma membrane, endoplasmic reticulum, lysosomes, and peroxisomes, to look for discriminating motifs in membrane spanning regions in addition

to known localizing signals.



The methodology could also be applied to membrane proteins unique to bacteria and other micro-organisms and could potentially identify new targets for antibiotics.


The relationship of this investigation to previous work in the area

• A large number of researchers are investigating globular proteins because of the easy availability of the data

• The prediction of membrane protein structures is a key area that remains unsolved (Baldi et al. 2002).

• There have been several attempts over the last 20 years to develop tools for predicting membrane spanning regions, reviewed recently by (Möller et al. 2001a).

BackgroundBackground


The relationship of this investigation to previous work in the area

• The problem of prediction is made topologically more complex by the presence of several transmembrane domains in many proteins, and the same authors (Möller et al.,2001b) conclude that current tools are far away from achieving a 95% reliability in prediction.The same group have mentioned that the software developed so far are basically divided on two principles-local approach and global approaches.

Current ToolsCurrent Tools


Neural Networks

• An artificial neuron is an information processing element that operates in a manner that resembles some operation of a biological neuron (simplified). A collection of several elements that can process information in parallel (and in connection) is a network of artificial neurons.

NeuronNeuron


Neural Networks

• According to Haykin, S. (1994) A neural network is a massively parallel distributed processor that has a natural propensity for storing experiential knowledge and making it available for use. It resembles the brain in two respects:

1. Knowledge is acquired by the network through a learning process.

2. Interneuron connection strengths known as synaptic weights are used to store the knowledge.

DefinitionDefinition


Architecture of the Neural Network

Amino Acid Sequence

Input Signals

Output

y1

Layer of Hidden Neurons

y2

x2

xm

x1

Layer of Output NeuronsLayer of Input

Neurons

Membrane Protein

Nonmembrane Protein

Network Used Network Used



Data Collection: ensuring that the correct data are gathered.Data Preparation: cleaning the data, and ensuring that they are in the appropriate format for Neural Connection.Design: choosing the best neural approach (Here MLP)Training and Testing: building the application.Experimentation: tailoring the application to improve the results.Implementation: producing the results.

Steps Involved Steps Involved


Data Collection

ArchitectureArchitecture



Data Preparation



Design

• The first consideration in designing the application was the neural technique to be adopted. This type of problem, where we want to link a set of inputs (sequence of Amino acids) to an output (membrane or nonmembrane or sub classification of membrane) should be solved using a supervised neural technique. There are three supervised neural techniques in Neural Connection, Radial Basis Function, the Bayesian Network and Multi-Layer Perceptron.



Design

• MLPs are the most commonly used neural computing technique.• The MLP differs from the Simple Perceptron in two major ways.• Firstly, it has an additional layer of neurons between the input

and output layer, known as the hidden layer.This layer vastly increases the learning power of the MLP.

• Secondly, it uses a transfer, or activation function to modify the input to a neuron.

• The activation of hidden and output layer neurons is the same as in the case of simple Perceptrons, while the transfer function is a smooth non-linear function, usually the sigmoid function.


Architecture of the Neural Network Training and Testing

Training Cycles


Results

Seminar.emf


Results


Conclusions

• This technique demonstrates that it is possible to combine the generalization accuracy of NNs with the comprehensibility generated by the knowledge extraction method .

• Preliminary results will be analysed and further improvements will be designed


Conclusion Contn..

• Modern data gathering techniques are producing vast amounts of data. However, data can be useless in the absence of understanding.

• The extraction of decision trees from trained NNs is an important addition to the data mining toolkit of knowledge extraction techniques(Browne, A & R.Sun.2001,1999)

• The combination of NNs with an algorithm to extract knowledge from the trained networks potentially offers the ‘best of both worlds’ to those attempting to make predictions on their data and simultaneously understand it.


Reference• Bose, S. and Browne, A.,Hassan K.,White,K. (2003) Knowledge Discovery in Bioinformatics using Neural Networks. Proceedings 6th

International Conference On Computer And Information Technology, Dhaka, Bangladesh • Baldi.P., G.Pollastri. "Machine Learning Structural and Functional Proteomics", IEEE Intelligent Systems (Intelligent Systems in Biology II),

March/April 2002.• Browne, A., Hudson, B. D., Whitley, D. C., Ford, M. G. and Picton, P. (2003) Biological Data Mining with Neural Networks: Implementation &

Application of a Flexible Decision Tree Extraction Algorithm to Genomic Problem Domains. Neurocomputing: Special Issue on Neural Networks in Bioinformatics (In Press) ISSN: 0925-2312.

• Browne, A., Hudson, B. D., Whitley, D. C. , Ford, M. G. and Picton, P. (2004) Biological data mining with neural networks: Implementation and application of a flexible decision tree extraction algorithm to genomic problem domains. Neurocomputing (In Press) ISSN: 0925-2312.

• Browne, A. (2002). Representation and extrapolation in multi-layer perceptrons. Neural Computation, 14(7), 1739-1754. ISSN: 0899-7667. • Browne, A. & R. Sun. (2001). Connectionist inference models. Neural Networks, 14(10), 1331-1355. ISSN: 0893-6080. • Browne, A. & R. Sun (1999). Connectionist variable binding. Expert Systems: The International Journal of Knowledge Engineering and Neural

Networks 16(3), 189-207. ISSN: 0266-4720. • Browne, A. & P. Picton (1999). Two analysis techniques for feed-forward networks. Behaviormetrika: Special Issue on Analysis of Knowledge

Representations in Neural Network Models 26(1), 75-87. ISSN: 0385-7417.• Möller, Michael D. R. Croning, and Rolf Apweiler (2001a) Evaluation of methods for the prediction of membrane spanning regions Bioinformatics

Vol 17: 646-653. • Möller, Jaak Vilo, and Michael D.R. Croning (2001b)Prediction of the coupling specificity of G protein coupled receptors to their G proteins

Bioinformatics 17: 174S-181S. • Möller, Evgenia V. Kriventseva, and Rolf Apweiler (2000) A collection of well characterized integral membrane proteins Bioinformatics 16: 1159-

1160.• Yang, S. & Browne, A. (2002a). Multistage neural networks: Adaptive combination of ensemble results. Proceedings of the Fourth International

Conference on Recent Advances in Soft Computing (RASC2002), Nottingham, UK. • Yang, S. & Browne, A. (2002b). Multistage Neural Network Ensembles. Proceedings of the Third International Workshop on Multiple Classifier

Systems, Caligari, Italy, published as Lecture Notes in Computer Science 2364, Springer Verlag, Berlin, Heidelberg.

use of neural networks to predict and analyze membrane proteins in the proteome

Documents

transmembrane proteins

presence of membrane

biochemical research

systems unable

trained neural networks

neural networks nns

field of neural networks

useful subsets of proteins