computational cell cycle analysis of single cell rna …computational cell cycle analysis of single...

1
Computational cell cycle analysis of single cell RNA-seq data Marmar Moussa Computer Science and Engineering University of Connecticut CT, USA [email protected] Abstract—The variation in the gene expression profiles of single cells in different phases of the cell cycle can present a leading source of variance between the cells and can interfere with the functional analysis of the transcritomic data. In this work, we review some of the few methods available to analyze cell cycle stages in scRNA-seq data and present a computational method for ordering single cells transcriptional profiles according to their cell cycle phases. Index Terms—Single cell RNA-seq, cell cycle I. BACKGROUND For this ongoing work we reviewed three existing methods for cell cycle analysis, first, reCAT [1], which reconstructs the cell cycle time- series, it models the reconstruction of the time-series as a traveling salesman problem (TSP). Second,cyclone [2], which is a classification algorithm based on selecting pairs of genes whose relative expression has a sign that changes with the cell-cycle phase in the training data. The gene pairs are used to quantify the evidence that a given cell is in G1, S or G2M phase. And third, ccRemover [3] attempts to identify and remove the cell-cycle effect from the transcriptional profiles data through identifying the loadings of the principal components that contribute to the cell cycle and subtract their projection on all genes from the data. In the next section we examine a PCA & t-SNE based method for ordering cells according to their cell cycle stage. II. METHODS Data sets: There are few scRNA-seq data sets where the cell- cycle status of each cell is known a priori; for this work, we used data from [4] scRNA-seq experiment on mouse embryonic stem cells (mESC) that were stained with Hoechst 33342 and Flow cytometry and sorted for G1, S and G2M stages of cell cycle Single cell RNA-seq was performed using Fluidigm C1 and libraries were generated using Nextera XT (Illumina) kit. We also used labeled undifferentiated human embryonic stem cells (hESCs) from [5]. PCA and t-SNE based cell cycle order inference: This method is based on the assumption that the first few principal component(s) (PC) of a set of annotated cell cycle marker genes are sufficient for constructing a cell to cell covariance matrix that reflects the variation from the cell cycle stages [4] [3] [2] [3]. First, the PCs are calculated using the RNA-seq counts sub-matrix of the cells and the cell cycle marker genes only. Marlker genes annotated to cell cycle are used from the Gene Ontology DB [6] and CycleBase [7]. Then 3 component t-SNE transformation is performed on the first few PCs and finally the cells are ordered based on their similarity with t- SNE components as features using hierarchical clustering algorithm. As part of ongoing investigations we test deciding on the number of PCs to use based on their explained variance scores, also we tested average, single and Ward linkage algorithms for obtaining the cell order, as well as several similarities metric. In Fig. 1 we show the hESC set with known cell cycle phases that are ordered according to reCAT (left) and the PCA then t-SNE based This work was partially supported by NSF Award 1564936. Fig. 1. Left: hESC cells’ order based on reCAT method. Right: hESC cells’ order using PCA and tSNE based analysis analysis (right), using first 15 PCs, the cosine similarity, 3 component t-SNE and average linkage algorithm. Colors in the top bar of the heat map indicate the true cell cycle phases (G1 black, G2 red and green for S phase). The log expression profile of the cells is plotted in the heat map and genes are ordered based on their cosine similarity (rows hierarchy). The PCA and t-SNE based method does indeed order the cells from the same phase together, whereas for reCAT only cells from G1 (black) group together uninterrupted from other phases. III. FUTURE WORK The analysis of dividing cells based on the first few PCs of their cell cycle related genes seems to help reconstruct the order of cells, still a performance measure is needed when the true order of cells is not known. Assigning the ordered cells to specific phases of the cell cycle is also part of the future work for this method. REFERENCES [1] Z. Liu, H. Lou, K. Xie, H. Wang, N. Chen, O. M. Aparicio, M. Q. Zhang, R. Jiang, and T. Chen, “Reconstructing cell cycle pseudo time- series via single-cell transcriptome data,” Nature communications, vol. 8, no. 1, p. 22, 2017. [2] A. Scialdone, K. N. Natarajan, L. R. Saraiva, V. Proserpio, S. A. Teichmann, O. Stegle, J. C. Marioni, and F. Buettner, “Computational assignment of cell-cycle stage from single-cell transcriptome data,” Meth- ods, vol. 85, pp. 54–61, 2015. [3] M. Barron and J. Li, “Identifying and removing the cell-cycle effect from single-cell rna-sequencing data,” Scientific Reports, vol. 6, 2016. [4] F. Buettner, K. N. Natarajan, F. P. Casale, V. Proserpio, A. Scialdone, F. J. Theis, S. A. Teichmann, J. C. Marioni, and O. Stegle, “Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells,” Nature Biotechnology, vol. 33, no. 2, pp. 155–160, 2015. [5] N. Leng, L.-F. Chu, C. Barry, Y. Li, J. Choi, X. Li, P. Jiang, R. M. Stewart, J. A. Thomson, and C. Kendziorski, “Oscope identifies oscillatory genes in unsynchronized single-cell rna-seq experiments,” Nature methods, vol. 12, no. 10, p. 947, 2015. [6] G. O. Consortium, “The gene ontology (go) database and informatics resource,” Nucleic acids research, vol. 32, no. suppl 1, pp. D258–D261, 2004. [7] A. Santos, R. Wernersson, and L. J. Jensen, “Cyclebase 3.0: a multi- organism database on cell-cycle regulation and phenotypes,” Nucleic acids research, p. gku1092, 2014. 978-1-5386-8520-4/18/$31.00 ©2018 IEEE 1

Upload: others

Post on 05-Jun-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computational cell cycle analysis of single cell RNA …Computational cell cycle analysis of single cell RNA-seq data Marmar Moussa Computer Science and Engineering University of Connecticut

Computational cell cycle analysis of single cell RNA-seq dataMarmar Moussa

Computer Science and EngineeringUniversity of Connecticut

CT, [email protected]

Abstract—The variation in the gene expression profiles of single cells indifferent phases of the cell cycle can present a leading source of variancebetween the cells and can interfere with the functional analysis of thetranscritomic data. In this work, we review some of the few methodsavailable to analyze cell cycle stages in scRNA-seq data and present acomputational method for ordering single cells transcriptional profilesaccording to their cell cycle phases.

Index Terms—Single cell RNA-seq, cell cycle

I. BACKGROUND

For this ongoing work we reviewed three existing methods for cellcycle analysis, first, reCAT [1], which reconstructs the cell cycle time-series, it models the reconstruction of the time-series as a travelingsalesman problem (TSP). Second,cyclone [2], which is a classificationalgorithm based on selecting pairs of genes whose relative expressionhas a sign that changes with the cell-cycle phase in the training data.The gene pairs are used to quantify the evidence that a given cell is inG1, S or G2M phase. And third, ccRemover [3] attempts to identifyand remove the cell-cycle effect from the transcriptional profiles datathrough identifying the loadings of the principal components thatcontribute to the cell cycle and subtract their projection on all genesfrom the data. In the next section we examine a PCA & t-SNE basedmethod for ordering cells according to their cell cycle stage.

II. METHODS

Data sets: There are few scRNA-seq data sets where the cell-cycle status of each cell is known a priori; for this work, weused data from [4] scRNA-seq experiment on mouse embryonicstem cells (mESC) that were stained with Hoechst 33342 and Flowcytometry and sorted for G1, S and G2M stages of cell cycle Singlecell RNA-seq was performed using Fluidigm C1 and libraries weregenerated using Nextera XT (Illumina) kit. We also used labeledundifferentiated human embryonic stem cells (hESCs) from [5].

PCA and t-SNE based cell cycle order inference: This methodis based on the assumption that the first few principal component(s)(PC) of a set of annotated cell cycle marker genes are sufficientfor constructing a cell to cell covariance matrix that reflects thevariation from the cell cycle stages [4] [3] [2] [3]. First, the PCsare calculated using the RNA-seq counts sub-matrix of the cells andthe cell cycle marker genes only. Marlker genes annotated to cellcycle are used from the Gene Ontology DB [6] and CycleBase [7].Then 3 component t-SNE transformation is performed on the first fewPCs and finally the cells are ordered based on their similarity with t-SNE components as features using hierarchical clustering algorithm.As part of ongoing investigations we test deciding on the number ofPCs to use based on their explained variance scores, also we testedaverage, single and Ward linkage algorithms for obtaining the cellorder, as well as several similarities metric.

In Fig. 1 we show the hESC set with known cell cycle phases thatare ordered according to reCAT (left) and the PCA then t-SNE based

This work was partially supported by NSF Award 1564936.

Fig. 1. Left: hESC cells’ order based on reCAT method. Right: hESC cells’order using PCA and tSNE based analysis

analysis (right), using first 15 PCs, the cosine similarity, 3 componentt-SNE and average linkage algorithm. Colors in the top bar of the heatmap indicate the true cell cycle phases (G1 black, G2 red and greenfor S phase). The log expression profile of the cells is plotted in theheat map and genes are ordered based on their cosine similarity (rowshierarchy). The PCA and t-SNE based method does indeed order thecells from the same phase together, whereas for reCAT only cellsfrom G1 (black) group together uninterrupted from other phases.

III. FUTURE WORK

The analysis of dividing cells based on the first few PCs of theircell cycle related genes seems to help reconstruct the order of cells,still a performance measure is needed when the true order of cells isnot known. Assigning the ordered cells to specific phases of the cellcycle is also part of the future work for this method.

REFERENCES

[1] Z. Liu, H. Lou, K. Xie, H. Wang, N. Chen, O. M. Aparicio, M. Q.Zhang, R. Jiang, and T. Chen, “Reconstructing cell cycle pseudo time-series via single-cell transcriptome data,” Nature communications, vol. 8,no. 1, p. 22, 2017.

[2] A. Scialdone, K. N. Natarajan, L. R. Saraiva, V. Proserpio, S. A.Teichmann, O. Stegle, J. C. Marioni, and F. Buettner, “Computationalassignment of cell-cycle stage from single-cell transcriptome data,” Meth-ods, vol. 85, pp. 54–61, 2015.

[3] M. Barron and J. Li, “Identifying and removing the cell-cycle effect fromsingle-cell rna-sequencing data,” Scientific Reports, vol. 6, 2016.

[4] F. Buettner, K. N. Natarajan, F. P. Casale, V. Proserpio, A. Scialdone, F. J.Theis, S. A. Teichmann, J. C. Marioni, and O. Stegle, “Computationalanalysis of cell-to-cell heterogeneity in single-cell RNA-sequencing datareveals hidden subpopulations of cells,” Nature Biotechnology, vol. 33,no. 2, pp. 155–160, 2015.

[5] N. Leng, L.-F. Chu, C. Barry, Y. Li, J. Choi, X. Li, P. Jiang, R. M. Stewart,J. A. Thomson, and C. Kendziorski, “Oscope identifies oscillatory genesin unsynchronized single-cell rna-seq experiments,” Nature methods,vol. 12, no. 10, p. 947, 2015.

[6] G. O. Consortium, “The gene ontology (go) database and informaticsresource,” Nucleic acids research, vol. 32, no. suppl 1, pp. D258–D261,2004.

[7] A. Santos, R. Wernersson, and L. J. Jensen, “Cyclebase 3.0: a multi-organism database on cell-cycle regulation and phenotypes,” Nucleic acidsresearch, p. gku1092, 2014.

978-1-5386-8520-4/18/$31.00 ©2018 IEEE 1