[IEEE 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012) - Istanbul (2012.08.26-2012.08.29)] 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining - Improving Tumor Identification by Using Tumor Markers Classification Strategy

Download [IEEE 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012) - Istanbul (2012.08.26-2012.08.29)] 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining - Improving Tumor Identification by Using Tumor Markers Classification Strategy

Post on 30-Mar-2017

213 views

Category:

Documents

1 download

TRANSCRIPT

  • Improving Tumor Identification by Using Tumor Markers Classification Strategy

    Florije Ismaili Faculty of Contemporary Sciences and Technologies

    SEEU Tetovo, Macedonia

    f.ismaili@seeu.edu.mk

    Luzana Bekiri Biochemical Laboratory Albimedika

    Tetovo, Macedonia lu.zana.b@hotmail.com

    AbstractTumor markers are substances, usually proteins that can be found in the blood, urine, stool, tumor tissue and more recently DNA changes, which are produced by the body in response to cancer growth. Thus far, more than 20 different tumor markers have been identified where some of them are specific for a particular type of cancer, while others are associated with several cancer types. The problem of tumor profiling has been extensively studied by the bioinformatics community. Although tumor classification has improved nowadays, there has been no general approach for identifying new cancer classes or for assigning tumors to known classes. In this paper we describe a novel strategy for tumor classification by using Growing Hierarchical Self-Organizing map (GHSOM) since it is able to weigh the contribution of each marker according to its relatedness with other tumor markers as well as handles highly skewed tumor marker expressions well. In the end, experiments are conducted to further demonstrate the feasibility and efficiency of tumor classification approach which provide valuable contribution in the field of oncology and cancer diseases and will be as a guide for the identification of these diseases.

    Keywords-cancer classification; tumor markers; tumor prediction.

    I. INTRODUCTION Cancer research is one of the major research areas in the

    medical field. The classification of different tumor types has great value in cancer diagnosis and drug discovery. Most of previous cancer classification studies are clinical-based and have limited diagnostic ability [1, 2, 3]. Recently, the researchers have started to explore the possibilities of retrieving information from a microarray gene expression data, which is known to contain the keys for addressing the fundamental problems relating to cancer diagnosis and drug discovery. The advent of DNA microarray technique has made possible the monitoring of thousands of gene expressions.

    Different classification methods have been applied to tumor classification. Some researchers are focused on molecular classification of various clinical samples, such as in acute leukemia, human cancer cell lines and brain tumors [1, 2, 3]. Others are focused on analytical approaches which have been applied for this task, such as support vector machines [4], k-nearest neighbors, weighted voting [1], artificial neural networks [2], and supervised clustering [5].

    Although a large number of methods have been proposed in recent years with promising results, there are still a lot of issues

    which need to be addressed and understood. For that reason, tumor classification still remains a challenging task [6].

    In this paper, we propose two novel classification models: Growing Hierarchical Self Organizing Map (GHSOM) clustering of the genes expression data [7] and Gene Ontology (GO) based clustering of the genes expression data [8].

    First, gene expression profiles are arranged in hierarchy according to their semantic similarity by using GHSOM clustering technique, followed by marker genes set prediction. Second, the Generalized Cosine Similarity is used to measure the similarity matching between two genes of Gene Ontology, which provides a controlled vocabulary to describe biological knowledge for gene and gene products and the relationships between them.

    The rest of this paper is organized as follows: section two will provide background of biological information about tumors and tumor markers, section three will introduce the method of genes expression data based clustering of the tumors and marker genes selection, the method of knowledge based clustering of the tumors is presented in section four. The empirical evaluation and results are given in section five while section six concludes this paper.

    II. BIOLOGICAL BACKGROUND INFORMATION AND PROBLEM STATEMENT

    In order to better understand the proposed approach for tumor classification it is worthy to give some fundamental knowledge in molecular biology.

    The main working units of every living system are cells, where all the instructions needed to direct their activities are contained within the chemical deoxyribonucleic acid or DNA [12].

    The entire DNA sequence that codes for a living thing is called its genome. The genome does not function as one long sequence, but is divided into a set of genes that has a specific and unique purpose. The process of transcribing a genes DNA sequence into RNA is called gene expression. A genes expression level provides a measure of activity of a gene under certain biochemical conditions where specific patterns of gene expression occur during different biological states such as embryogenesis, cell development, and during normal physiological responses in tissues and cells [12]. The change of the expression values of certain genes indicates certain diseases, such as cancer [13]. Thus, for identifying different

    2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

    978-0-7695-4799-2/12 $26.00 2012 IEEEDOI 10.1109/ASONAM.2012.141

    811

    2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

    978-0-7695-4799-2/12 $26.00 2012 IEEEDOI 10.1109/ASONAM.2012.141

    779

  • gene functions and cancer diagnosis, researches focused on DNA microarrays and analysis of gene expressions.

    To obtain a quantitative profile of cellular gene expression SAGE (Serial analysis of gene expression) is designed that quantifies a tag which represents the transcription product of that gene. The data product of the SAGE technique is a list of tags, with their corresponding count values, which is a digital representation of cellular gene expression.

    A. Statistical vs. Biological Significance An important area of bioinformatics study is to provide the

    biologists with biologically meaningful information about the genes, their classification and interactions between attributes.

    Information about gene interaction is of great biological relevance for cancer classification. It provides the biologists a clearer understanding of the roles a certain set of genes play in cancer development and related issues. One important issue is to find marker genes.

    In order to better understand the role of marker genes, is necessary to understand the role of tumor markers in biology.

    Tumor markers are substances that are produced by cancer cells or by other cells of the body in response to cancer or certain noncancerous conditions [14]. Most tumor markers are proteins, but recently, patterns of gene expression and changes to DNA have also begun to be used as tumor markers. Tumor markers are used to help detect, diagnose, and manage some types of cancer. In bioinformatics, the role of tumor markers is played by marker genes.

    Marker genes are genes whose expression values are biologically useful for determining the class of the samples. In other words, marker genes are genes that characterize the tumor classes.

    III. GENES EXPRESSION DATA BASED CLUSTERING OF THE TUMORS AND MARKER GENES SELECTION

    The challenges of cancer classification are: class discovery and class prediction. Class discovery refers to defining previously unrecognized tumor subtypes. Class prediction refers to the assignment of particular tumor samples to already-defined classes, which could reflect current states or future outcomes. Clustering techniques have been proven to be helpful for understanding microarray gene expression data [4, 6]. Co expressed genes can be grouped in clusters based on the expression patterns. We have employed the unsupervised clustering method GHSOM [7] in order to cluster the tumor genes.

    The process of hierarchical clustering of the genes expression data consists of the following phases:

    GHSOM clustering of genes. Selecting the optimal marker genes

    A. GHSOM clustering of genes Tumor classification using gene expression data has the

    major challenges because of the characteristics in the microarray data set, which has small number of samples and

    large number of genes. In order to best classify the gene expression data and to split a large data set into smaller groups, GHSOM clustering method will be used.

    GHSOM can build a hierarchy of multiple layers where each layer consists of several independent growing SOMs. GHSOM architecture is similar to a tree structure where the SOM(s) at the upper layers contains global information of the organizations of the clusters in the data, while the lowest layers of the hierarchy have information about the details.

    In our approach, GHSOM is able to cluster the genes, arranged as nodes in hierarchy, where each hierarchy presents a SOM. Each hierarchy SOM presents a group of the genes related according to their semantic similarity.

    Figure 1. Simplified workflow to build the tumor hierachies

    Each cluster has a center which represents the prototype gene. Here, the prototype gene is meant to represent the genes in a cluster. From a biological point of view, the prototype is characterized by an expression profiles the most similar to all genes of cluster. In other words, the co-expressed genes may belong to the same pathways or have similar function. In our approach, we select the gene as the prototype gene which has the minimum total distance to other genes. The set of prototype gene of a cluster represents the marker genes of each hierarchy.

    The vector representation of marker genes is used for calculating the expression distance measure between marker genes and new presented sample, in order to predict most similar cluster. For this purpose the Euclidean [7] distance between the vectors of the two genes is used.

    B. Knowledge Based Clustering of the Tumors Expression based clustering does not always result in

    clusters which are biologically similar. In order to help in selecting a clustering algorithm best suited to produce meaningful clusters there is a need of some cluster assessment which incorporates biological information.

    Recently, clustering techniques are incorporating valuable biological information present in Gene Ontology. Most of the work in this direction focuses on using the Gene Ontology for cluster validation [9]. Gene Ontology (GO) [8] provides a controlled vocabulary to describe biological knowledge for gene and gene products and the relationships between them. Our method is based on the definitions and examples provided in [9] which is a repository for all information in GO. The three components of GO are molecular function, biological process,

    812780

  • and cellular component. A gene product can have one or more molecular functions, participate in one or more biological processes, and can be a part of one or more cellular components.

    GO terms are organized in a Directed Acyclic Graph(DAG), such that child terms are more specialized than parent terms. Each GO term is annotated with a list of gene products, while the edges of the DAG representing relationship between connecting GO terms has a type property which can be is a or part of relationship.

    By providing a standard vocabulary across any biological resources, the GO enables researchers to use this information for automatic data analysis done by computers and not by humans.

    For determining the closeness between GO nodes, the GO process ontology is employed. Our algorithm takes into consideration the semantic distance between concepts on the ontology tree representation. For given two terms or collection of elements C1 and C2, the semantic distance is defined as similarity of concepts in relation subClassOf.

    In this manner we should take in consideration the depth of the node and lowest common ancestor (LCA) which is the node of greatest depth that is an ancestor of both C1 and C2. The semantic similarity can be defined as follows:

    )()()),((*2),(

    21

    2121 CdepthCdepth

    CCLCAdepthCCSim+

    = (1)

    To determine the similarity between genes in ontology (GO) and presented sample (GP) the Generalized Cosine Similarity is used [10]. If cos is equal to 1, the gene in ontology matches exactly with sample. The smaller the value of cos is, the farther the gene in ontology deviates from sample. When cos is equal to 0, gene in ontology will fail to match with the sample. The matching degrees from the result of cosine value are defined as follows:

    =

    =

    cos,cos,cos,

    0.1cos,

    ),(

    ifFailifSubsumeifInPlug

    ifExact

    GPGODegree (2)

    The value assignments for , and are done to represent the strength of match value and are based just on heuristic decrease of match values.

    The defined distance measure allows us to calculate the distance between any two gene pairs and construct the GO distance matrix. A (n*m) table T is created where n is the total number of genes and m is the number of nodes in the GO DAG.

    C. Syntactic Similarity and Semantic Similarity Aggregation Let Simmargen represent the calculated similarity between

    marker genes (G1) and the new presented sample (G2) and SimGO represent the similarity between genes in ontology and presented sample defined above.

    The similarity metric in our tumor class prediction and class discovery approach is defined as:

    2/)),(*),(*(),(

    21

    21arg21

    GGSim

    GGSimGGSim

    GO

    enm

    +=

    (3)

    + =1 (4)

    Where and are two parameters for adjusting search performance.

    IV. EMPIRICAL EVALUATION AND RESULTS In this section, we provide the experiments result collected

    from our proposed approach. Experiments were conducted in two different ways with the purpose of evaluating the efficiency of proposed model:

    a) Tumor prediction using genes expression data in the suggested model.

    b) Tumor prediction using tumor markers in clinical experiments.

    A. Tumor Prediction using Genes Expression Data in the Suggested Model In the following sections, we demonstrate the performance

    of the suggested model using publicly available microarray data sets of colon cancer [11].

    Using Affymetrix oligonucleotide arrays, expression levels of 40 tumor and 22 normal colon tissues were measured for 6500 human genes. A dataset containing intensities of 2000 genes in 22 normal and 40 tumor colon tissues was available from [11].

    GHSOM training algorithm started with a 2 x 2 SOM at layer1, based on the artificial unit which represents the means of all data points at layer 0.

    Training the GHSOM with parameter m = 0.07 and u = 0.0035 results in a rather deep hierarchical structure of up to 4 layers. The layer 1 map grows to a size of 4 x 3 units, where each unit is expanded at subsequent layers.

    Based on this division of the dominant topical clusters, 12 individual maps were created on layer 2 who represents the various topics of layer 1 in more detail. Each map in layer 2 represents the data of the corr...