[ieee 2012 international conference on advances in social networks analysis and mining (asonam 2012)...

Download [IEEE 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012) - Istanbul (2012.08.26-2012.08.29)] 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining - Improving Tumor Identification by Using Tumor Markers Classification Strategy

Post on 30-Mar-2017




1 download

Embed Size (px)


  • Improving Tumor Identification by Using Tumor Markers Classification Strategy

    Florije Ismaili Faculty of Contemporary Sciences and Technologies

    SEEU Tetovo, Macedonia


    Luzana Bekiri Biochemical Laboratory Albimedika

    Tetovo, Macedonia lu.zana.b@hotmail.com

    AbstractTumor markers are substances, usually proteins that can be found in the blood, urine, stool, tumor tissue and more recently DNA changes, which are produced by the body in response to cancer growth. Thus far, more than 20 different tumor markers have been identified where some of them are specific for a particular type of cancer, while others are associated with several cancer types. The problem of tumor profiling has been extensively studied by the bioinformatics community. Although tumor classification has improved nowadays, there has been no general approach for identifying new cancer classes or for assigning tumors to known classes. In this paper we describe a novel strategy for tumor classification by using Growing Hierarchical Self-Organizing map (GHSOM) since it is able to weigh the contribution of each marker according to its relatedness with other tumor markers as well as handles highly skewed tumor marker expressions well. In the end, experiments are conducted to further demonstrate the feasibility and efficiency of tumor classification approach which provide valuable contribution in the field of oncology and cancer diseases and will be as a guide for the identification of these diseases.

    Keywords-cancer classification; tumor markers; tumor prediction.

    I. INTRODUCTION Cancer research is one of the major research areas in the

    medical field. The classification of different tumor types has great value in cancer diagnosis and drug discovery. Most of previous cancer classification studies are clinical-based and have limited diagnostic ability [1, 2, 3]. Recently, the researchers have started to explore the possibilities of retrieving information from a microarray gene expression data, which is known to contain the keys for addressing the fundamental problems relating to cancer diagnosis and drug discovery. The advent of DNA microarray technique has made possible the monitoring of thousands of gene expressions.

    Different classification methods have been applied to tumor classification. Some researchers are focused on molecular classification of various clinical samples, such as in acute leukemia, human cancer cell lines and brain tumors [1, 2, 3]. Others are focused on analytical approaches which have been applied for this task, such as support vector machines [4], k-nearest neighbors, weighted voting [1], artificial neural networks [2], and supervised clustering [5].

    Although a large number of methods have been proposed in recent years with promising results, there are still a lot of issues

    which need to be addressed and understood. For that reason, tumor classification still remains a challenging task [6].

    In this paper, we propose two novel classification models: Growing Hierarchical Self Organizing Map (GHSOM) clustering of the genes expression data [7] and Gene Ontology (GO) based clustering of the genes expression data [8].

    First, gene expression profiles are arranged in hierarchy according to their semantic similarity by using GHSOM clustering technique, followed by marker genes set prediction. Second, the Generalized Cosine Similarity is used to measure the similarity matching between two genes of Gene Ontology, which provides a controlled vocabulary to describe biological knowledge for gene and gene products and the relationships between them.

    The rest of this paper is organized as follows: section two will provide background of biological information about tumors and tumor markers, section three will introduce the method of genes expression data based clustering of the tumors and marker genes selection, the method of knowledge based clustering of the tumors is presented in section four. The empirical evaluation and results are given in section five while section six concludes this paper.


    In order to better understand the proposed approach for tumor classification it is worthy to give some fundamental knowledge in molecular biology.

    The main working units of every living system are cells, where all the instructions needed to direct their activities are contained within the chemical deoxyribonucleic acid or DNA [12].

    The entire DNA sequence that codes for a living thing is called its genome. The genome does not function as one long sequence, but is divided into a set of genes that has a specific and unique purpose. The process of transcribing a genes DNA sequence into RNA is called gene expression. A genes expression level provides a measure of activity of a gene under certain biochemical conditions where specific patterns of gene expression occur during different biological states such as embryogenesis, cell development, and during normal physiological responses in tissues and cells [12]. The change of the expression values of certain genes indicates certain diseases, such as cancer [13]. Thus, for identifying different

    2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

    978-0-7695-4799-2/12 $26.00 2012 IEEEDOI 10.1109/ASONAM.2012.141


    2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

    978-0-7695-4799-2/12 $26.00 2012 IEEEDOI 10.1109/ASONAM.2012.141


  • gene functions and cancer diagnosis, researches focused on DNA microarrays and analysis of gene expressions.

    To obtain a quantitative profile of cellular gene expression SAGE (Serial analysis of gene expression) is designed that quantifies a tag which represents the transcription product of that gene. The data product of the SAGE technique is a list of tags, with their corresponding count values, which is a digital representation of cellular gene expression.

    A. Statistical vs. Biological Significance An important area of bioinformatics study is to provide the

    biologists with biologically meaningful information about the genes, their classification and interactions between attributes.

    Information about gene interaction is of great biological relevance for cancer classification. It provides the biologists a clearer understanding of the roles a certain set of genes play in cancer development and related issues. One important issue is to find marker genes.

    In order to better understand the role of marker genes, is necessary to understand the role of tumor markers in biology.

    Tumor markers are substances that are produced by cancer cells or by other cells of the body in response to cancer or certain noncancerous conditions [14]. Most tumor markers are proteins, but recently, patterns of gene expression and changes to DNA have also begun to be used as tumor markers. Tumor markers are used to help detect, diagnose, and manage some types of cancer. In bioinformatics, the role of tumor markers is played by marker genes.

    Marker genes are genes whose expression values are biologically useful for determining the class of the samples. In other words, marker genes are genes that characterize the tumor classes.


    The challenges of cancer classification are: class discovery and class prediction. Class discovery refers to defining previously unrecognized tumor subtypes. Class prediction refers to the assignment of particular tumor samples to already-defined classes, which could reflect current states or future outcomes. Clustering techniques have been proven to be helpful for understanding microarray gene expression data [4, 6]. Co expressed genes can be grouped in clusters based on the expression patterns. We have employed the unsupervised clustering method GHSOM [7] in order to cluster the tumor genes.

    The process of hierarchical clustering of the genes expression data consists of the following phases:

    GHSOM clustering of genes. Selecting the optimal marker genes

    A. GHSOM clustering of genes Tumor classification using gene expression data has the

    major challenges because of the characteristics in the microarray data set, which has small number of samples and

    large number of genes. In order to best classify the gene expression data and to split a large data set into smaller groups, GHSOM clustering method will be used.

    GHSOM can build a hierarchy of multiple layers where each layer consists of several independent growing SOMs. GHSOM architecture is similar to a tree structure where the SOM(s) at the upper layers contains global information of the organizations of the clusters in the data, while the lowest layers of the hierarchy have information about the details.

    In our approach, GHSOM is able to cluster the genes, arranged as nodes in hierarchy, where each hierarchy presents a SOM. Each hierarchy SOM presents a group of the genes related according to their semantic similarity.

    Figure 1. Simplified workflow to build the tumor hierachies

    Each cluster has a center which represents the prototype gene. Here, the prototype gene is meant to represent the genes in a cluster. From a biological point of view, the prototype is characterized by an expression profiles the most similar to all genes of cluster. In other words, the co-expressed genes may belong to the same pathways or have similar function. In our approach, we select the gene as the prototype gene which has the minimum total distance to other genes. The set of prototype gene of a cluster represents the marker genes of each hierarchy.

    The vector representation of marker genes is


View more >