[ieee 2012 international conference on advances in social networks analysis and mining (asonam 2012)...

Improving Tumor Identification by Using Tumor Markers Classification Strategy

Florije Ismaili Faculty of Contemporary Sciences and Technologies

SEEU Tetovo, Macedonia

[email protected]

Luzana Bekiri Biochemical Laboratory “Albimedika”

Tetovo, Macedonia [email protected]

Abstract—Tumor markers are substances, usually proteins that can be found in the blood, urine, stool, tumor tissue and more recently DNA changes, which are produced by the body in response to cancer growth. Thus far, more than 20 different tumor markers have been identified where some of them are specific for a particular type of cancer, while others are associated with several cancer types. The problem of tumor profiling has been extensively studied by the bioinformatics community. Although tumor classification has improved nowadays, there has been no general approach for identifying new cancer classes or for assigning tumors to known classes. In this paper we describe a novel strategy for tumor classification by using Growing Hierarchical Self-Organizing map (GHSOM) since it is able to weigh the contribution of each marker according to its relatedness with other tumor markers as well as handles highly skewed tumor marker expressions well. In the end, experiments are conducted to further demonstrate the feasibility and efficiency of tumor classification approach which provide valuable contribution in the field of oncology and cancer diseases and will be as a guide for the identification of these diseases.

Keywords-cancer classification; tumor markers; tumor prediction.

I. INTRODUCTION Cancer research is one of the major research areas in the

medical field. The classification of different tumor types has great value in cancer diagnosis and drug discovery. Most of previous cancer classification studies are clinical-based and have limited diagnostic ability [1, 2, 3]. Recently, the researchers have started to explore the possibilities of retrieving information from a microarray gene expression data, which is known to contain the keys for addressing the fundamental problems relating to cancer diagnosis and drug discovery. The advent of DNA microarray technique has made possible the monitoring of thousands of gene expressions.

Different classification methods have been applied to tumor classification. Some researchers are focused on molecular classification of various clinical samples, such as in acute leukemia, human cancer cell lines and brain tumors [1, 2, 3]. Others are focused on analytical approaches which have been applied for this task, such as support vector machines [4], k-nearest neighbors, weighted voting [1], artificial neural networks [2], and supervised clustering [5].

Although a large number of methods have been proposed in recent years with promising results, there are still a lot of issues

which need to be addressed and understood. For that reason, tumor classification still remains a challenging task [6].

In this paper, we propose two novel classification models: Growing Hierarchical Self Organizing Map (GHSOM) clustering of the genes expression data [7] and Gene Ontology (GO) based clustering of the genes expression data [8].

First, gene expression profiles are arranged in hierarchy according to their semantic similarity by using GHSOM clustering technique, followed by marker genes set prediction. Second, the Generalized Cosine Similarity is used to measure the similarity matching between two genes of Gene Ontology, which provides a controlled vocabulary to describe biological knowledge for gene and gene products and the relationships between them.

The rest of this paper is organized as follows: section two will provide background of biological information about tumors and tumor markers, section three will introduce the method of gene’s expression data based clustering of the tumors and marker genes selection, the method of knowledge based clustering of the tumors is presented in section four. The empirical evaluation and results are given in section five while section six concludes this paper.

II. BIOLOGICAL BACKGROUND INFORMATION AND PROBLEM STATEMENT

In order to better understand the proposed approach for tumor classification it is worthy to give some fundamental knowledge in molecular biology.

The main working units of every living system are cells, where all the instructions needed to direct their activities are contained within the chemical deoxyribonucleic acid or DNA [12].

The entire DNA sequence that codes for a living thing is called its genome. The genome does not function as one long sequence, but is divided into a set of genes that has a specific and unique purpose. The process of transcribing a gene’s DNA sequence into RNA is called gene expression. A gene’s expression level provides a measure of activity of a gene under certain biochemical conditions where specific patterns of gene expression occur during different biological states such as embryogenesis, cell development, and during normal physiological responses in tissues and cells [12]. The change of the expression values of certain genes indicates certain diseases, such as cancer [13]. Thus, for identifying different

2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

978-0-7695-4799-2/12 $26.00 © 2012 IEEE

DOI 10.1109/ASONAM.2012.141

811

2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

978-0-7695-4799-2/12 $26.00 © 2012 IEEE

DOI 10.1109/ASONAM.2012.141

779

gene functions and cancer diagnosis, researches focused on DNA microarrays and analysis of gene expressions.

To obtain a quantitative profile of cellular gene expression SAGE (Serial analysis of gene expression) is designed that quantifies a “tag” which represents the transcription product of that gene. The data product of the SAGE technique is a list of tags, with their corresponding count values, which is a digital representation of cellular gene expression.

A. Statistical vs. Biological Significance An important area of bioinformatics study is to provide the

biologists with biologically meaningful information about the genes, their classification and interactions between attributes.

Information about gene interaction is of great biological relevance for cancer classification. It provides the biologists a clearer understanding of the roles a certain set of genes play in cancer development and related issues. One important issue is to find marker genes.

In order to better understand the role of marker genes, is necessary to understand the role of tumor markers in biology.

Tumor markers are substances that are produced by cancer cells or by other cells of the body in response to cancer or certain noncancerous conditions [14]. Most tumor markers are proteins, but recently, patterns of gene expression and changes to DNA have also begun to be used as tumor markers. Tumor markers are used to help detect, diagnose, and manage some types of cancer. In bioinformatics, the role of tumor markers is played by marker genes.

Marker genes are genes whose expression values are biologically useful for determining the class of the samples. In other words, marker genes are genes that characterize the tumor classes.

III. GENE’S EXPRESSION DATA BASED CLUSTERING OF THE TUMORS AND MARKER GENES SELECTION

The challenges of cancer classification are: class discovery and class prediction. Class discovery refers to defining previously unrecognized tumor subtypes. Class prediction refers to the assignment of particular tumor samples to already-defined classes, which could reflect current states or future outcomes. Clustering techniques have been proven to be helpful for understanding microarray gene expression data [4, 6]. Co expressed genes can be grouped in clusters based on the expression patterns. We have employed the unsupervised clustering method GHSOM [7] in order to cluster the tumor genes.

The process of hierarchical clustering of the genes expression data consists of the following phases:

• GHSOM clustering of genes. • Selecting the optimal marker genes

A. GHSOM clustering of genes Tumor classification using gene expression data has the

major challenges because of the characteristics in the microarray data set, which has small number of samples and

large number of genes. In order to best classify the gene expression data and to split a large data set into smaller groups, GHSOM clustering method will be used.

GHSOM can build a hierarchy of multiple layers where each layer consists of several independent growing SOMs. GHSOM architecture is similar to a tree structure where the SOM(s) at the upper layers contains global information of the organizations of the clusters in the data, while the lowest layers of the hierarchy have information about the details.

In our approach, GHSOM is able to cluster the genes, arranged as nodes in hierarchy, where each hierarchy presents a SOM. Each hierarchy SOM presents a group of the genes related according to their semantic similarity.

Figure 1. Simplified workflow to build the tumor hierachies

Each cluster has a center which represents the prototype gene. Here, the prototype gene is meant to represent the genes in a cluster. From a biological point of view, the prototype is characterized by an expression profiles the most similar to all genes of cluster. In other words, the co-expressed genes may belong to the same pathways or have similar function. In our approach, we select the gene as the prototype gene which has the minimum total distance to other genes. The set of prototype gene of a cluster represents the marker genes of each hierarchy.

The vector representation of marker genes is used for calculating the expression distance measure between marker genes and new presented sample, in order to predict most similar cluster. For this purpose the Euclidean [7] distance between the vectors of the two genes is used.

B. Knowledge Based Clustering of the Tumors Expression based clustering does not always result in

clusters which are biologically similar. In order to help in selecting a clustering algorithm best suited to produce “meaningful” clusters there is a need of some cluster assessment which incorporates biological information.

Recently, clustering techniques are incorporating valuable biological information present in Gene Ontology. Most of the work in this direction focuses on using the Gene Ontology for cluster validation [9]. Gene Ontology (GO) [8] provides a controlled vocabulary to describe biological knowledge for gene and gene products and the relationships between them. Our method is based on the definitions and examples provided in [9] which is a repository for all information in GO. The three components of GO are molecular function, biological process,

812780

and cellular component. A gene product can have one or more molecular functions, participate in one or more biological processes, and can be a part of one or more cellular components.

GO terms are organized in a Directed Acyclic Graph(DAG), such that child terms are more specialized than parent terms. Each GO term is annotated with a list of gene products, while the edges of the DAG representing relationship between connecting GO terms has a type property which can be “is a” or “part of” relationship.

By providing a standard vocabulary across any biological resources, the GO enables researchers to use this information for automatic data analysis done by computers and not by humans.

For determining the closeness between GO nodes, the GO process ontology is employed. Our algorithm takes into consideration the semantic distance between concepts on the ontology tree representation. For given two terms or collection of elements C1 and C2, the semantic distance is defined as similarity of concepts in relation subClassOf.

In this manner we should take in consideration the depth of the node and lowest common ancestor (LCA) which is the node of greatest depth that is an ancestor of both C1 and C2. The semantic similarity can be defined as follows:

)()()),((*2),(

21

2121 CdepthCdepth

CCLCAdepthCCSim+

= (1)

To determine the similarity between genes in ontology (GO) and presented sample (GP) the Generalized Cosine Similarity is used [10]. If θcos is equal to 1, the gene in ontology matches exactly with sample. The smaller the value of θcos is, the farther the gene in ontology deviates from sample. When θcos is equal to 0, gene in ontology will fail to match with the sample. The matching degrees from the result of cosine value are defined as follows:

≤≤≥−

=

=

λθβθαθ

θ

cos,cos,cos,

0.1cos,

),(

ifFailifSubsumeifInPlug

ifExact

GPGODegree (2)

The value assignments for , and are done to represent the strength of match value and are based just on heuristic decrease of match values.

The defined distance measure allows us to calculate the distance between any two gene pairs and construct the GO distance matrix. A (n*m) table T is created where n is the total number of genes and m is the number of nodes in the GO DAG.

C. Syntactic Similarity and Semantic Similarity Aggregation Let Simmargen represent the calculated similarity between

marker genes (G1) and the new presented sample (G2) and SimGO represent the similarity between genes in ontology and presented sample defined above.

The similarity metric in our tumor class prediction and class discovery approach is defined as:

2/)),(*),(*(),(

21

21arg21

GGSim

GGSimGGSim

GO

enm

βα +=

(3)

α + β =1 (4)

Where and are two parameters for adjusting search performance.

IV. EMPIRICAL EVALUATION AND RESULTS In this section, we provide the experiments result collected

from our proposed approach. Experiments were conducted in two different ways with the purpose of evaluating the efficiency of proposed model:

a) Tumor prediction using genes expression data in the suggested model.

b) Tumor prediction using tumor markers in clinical experiments.

A. Tumor Prediction using Genes Expression Data in the Suggested Model In the following sections, we demonstrate the performance

of the suggested model using publicly available microarray data sets of colon cancer [11].

Using Affymetrix oligonucleotide arrays, expression levels of 40 tumor and 22 normal colon tissues were measured for 6500 human genes. A dataset containing intensities of 2000 genes in 22 normal and 40 tumor colon tissues was available from [11].

GHSOM training algorithm started with a 2 x 2 SOM at layer1, based on the artificial unit which represents the means of all data points at layer 0.

Training the GHSOM with parameter m = 0.07 and u = 0.0035 results in a rather deep hierarchical structure of up to 4 layers. The layer 1 map grows to a size of 4 x 3 units, where each unit is expanded at subsequent layers.

Based on this division of the dominant topical clusters, 12 individual maps were created on layer 2 who represents the various topics of layer 1 in more detail. Each map in layer 2 represents the data of the corresponding higher-layer unit in more detail. Some of the maps in layer 2 were further expanded as distinct SOMs in layer 3.

In order to test the effect of different parameter settings we trained a second GHSOM with m set to half of the previous

813781

value ( m = 0.035), while we kept the same u. This leads to a more shallow hierarchical structure where the layer 1 map growing was of a size of 5 x 4 units.

These units were further expanded only up to 2 layers. However, due to the large size of the resulting first layer map, a good number of the data already are provided at this layer. This results in some larger clusters to be represented by two neighboring units already at the first layer, rather than being split up in a lower layer of the hierarchy.

In order to use the results in further development, we chose the deep hierarchy which was more appropriate for constructing the map matrices.

Several experiments have been performed in order to evaluate the performance of the proposed method in comparison with well known SOM neural network for cancer prediction.

Figure 2. Accuracy of Classification for Different Gene Samples

Figure 3 shows the obtained accuracy for classification. It is clear from the figure that the proposed technique resulted in better accuracy for all the samples used for classification, which proves the relevance of this work.

B. Clinical Analysis of TumorPrediction using Tumor Markers Clinical experiments on tumor prediction using tumor

markers are conducted in order to verify the results from our proposed approach with real life results. The experiments are conducted in “Albimedica”, biochemical lab in Tetovo.

Analysis of patients is made by venous blood. The serum for further analysis is obtained through blood centrifugation. During the analysis the level of Ca 19-9, Ca125II and Ca15-3 tumor markers is defined in blood-serum using VIDAS apparatus. This is the method of detection of IgG antibodies in human serum or plasma (EDTA) using ELFA technique (Enzyme assay fluerescent).

Results are calculated automatically by the apparatus using the calibration curve stored in the device (4-parameter logistic model), concentrations are expressed in <U/ml>.

Tests at concentrations higher than 500 U / ml should be tested once again after the weaker dilution in (R1). If the

dilution factor is not introduced when the work list is written, in order to obtain the concentration of the test, result obtained with the dilution factor should be multiplied.

The patient's history and implementation of several other tests should be taken in consideration during results interpretation.

Figure 3. The level of Ca 19-9 in different human age groups.

Based on the clinical analysis conducted in patients of different age groups, from 0-80 years, where women constitute around 85% while about 15% were males, we get these results:

The presented graphs show the average values according to sex and age groups during Ca 19-9, Ca 15-3 and Ca 125II tumor markers analysis. Ca 19-9, Ca 125 II, Ca 15-3 Tumor Markers are key tumors identifiers of various organs such as pancreas, heparin, lungs, intestinal tract digestive, ovaries, etc... From the graphs it is clear that in all cases, female patients have higher levels of tumor markers. The genome of the patient diagnosed with cancer (3%) is preprocessed and introduced to our algorithm. Our algorithm shows 93 % accuracy with real results.

Our results may serve as potential indicators for early identification of tumors where based on these results, prevention and treatment of cancer measures can be taken.

Figure 4. The level of CA 15-3 in different female age groups.

814782

Figure 5. The level of Ca 125 II in different female age groups.

V. CONCLUSION Classifying different types of tumors has become one of the

most important research topics in medicine in order to simplify the cancer diagnosis and drug discovery since in the past cancer categorization is generally done on morphological and clinical analysis.

Although different classification techniques have been developed for cancer classification, there are still many drawbacks in their classification capability.

In order to help the improvement of cancer classification, in this paper we proposed a novel approach for cancer classification based on gene expression data where gene expression profiles are arranged in hierarchy according to their semantic similarity by using GHSOM clustering technique.

According to results, expression based clustering does not always result in clusters which are biologically similar. To overcome this problem we have incorporate biological information present in Gene Ontology. The Generalized Cosine Similarity is used to measure the similarity matching between two genes of Gene Ontology.

Finally, experiments are conducted for proving the accuracy and efficiency of the proposed approach in the prevention and treatment of cancer.

REFERENCES [1] Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov

JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, and Lander ES, “Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring,” Science1999, 286:531-537.

[2] Khan J, Wei JS, Ringner M, Saal LH, Ladanyi M, Westermann F, Berthold F, Schwab M, Antonescu CR, Peterson C, and Meltzer PS, “Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks,” Nat Med 2001, 7:673-679.

[3] Pomeroy SL, Tamayo P, Gaasenbeek M, Sturla LM, Angelo M, Mclaughlin ME, Kim JYH, Goumnerova LC, Black PM, Lau C, Allen JC, Zagzag D, Olson JM, Curran T, Wetmore C, Biegel JA, Poggio T, Mukherjee S, Rifkin R, Califano A, Stolovitzky G, Louis DN, Mesirov JP, Lander ES, and Golub TR, “Prediction of central nervous system embryonal tumor outcome based on gene expression,” Nature 2002, 415:436-442.

[4] Yeang CH, Ramaswamy S, Tamayo P, Mukherjee S, Rifkin RM, Angelo M, Reich M, Lander E, Mesirov , Golub T, “Molecular classification of multiple tumor types,” Bioinformatics 2001, 17:S316-S322.

[5] Dettling M, and Buhlmann P, “Supervised clustering of genes,” Genome Biol 2002, 3:12.

[6] Dudoit S, Fridlyand J, and Speed Tp, “Comparison of discrimination methods for the classification of tumors using gene expression data,” J Am Stat Assoc 2002, 97:77-87.

[7] Andreas R., Dieter M., and Michael D, “The growing hierarchical self-organizing map: Exploratory analysis of high dimensional data,” IEEE Transactions on Neural Networks, 13:1331,1341, (2002).

[8] http://www.geneontology.org [9] Nora Speer, Christian Spieth, and Andreas Zell, “Biological Cluster

Validity Indices Based on the Gene Ontology,” LNCS, pp. 429-439, 2005.

[10] Prasanna G., Hector G. M., and Jennifer W, “Exploiting hierarchical domain structure to compute similarity.,” ACM TRANSACTIONS ON INFORMATION SYSTEMS, 21(1):64– 93, (2003).

[11] http://microarray.princeton.edu/oncology/affydata/index.html. [12] P. Russel. Fundamentals of Genetics. Addison Wesly Longman Inc.,

2000. [13] A. Ben-Dor, L. Bruhn, N. Friedman, I. Nachman, M. Schummer, and Z.

Yakhini, “ Tissue classication with gene expression profiles,” In Proc. of the Fourth Annual Int. Conf. on Computational Molecular Biology, 2000.

[14] Tumor Markers, Tumor Markers, Nacional Cancer Institute, http://www.cancer.gov/cancertopics/factsheet/detection/Fs5_18.pdf, 2011.

815783

[ieee 2012 international conference on advances in social networks analysis and mining (asonam 2012)...

Documents