making sense out of flow cytometry data overload

Post on 06-Feb-2016

44 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Making Sense out of Flow Cytometry Data Overload. A crash course in R/Bioconductor and flow cytometry fingerprinting. Outline. Background R Bioconductor Motivating examples Starting R, entering commands How to get help R fundamentals Sequences and Repeats Characters and Numbers - PowerPoint PPT Presentation

TRANSCRIPT

© 2010 by University of Pennsylvania School of Medicine

Making Sense out of Flow Cytometry Data Overload

A crash course in R/Bioconductor and flow cytometry fingerprinting

Outline

• Background R Bioconductor

• Motivating examples• Starting R, entering commands• How to get help• R fundamentals

Sequences and Repeats Characters and Numbers Vectors and Matrices Data Frames and Lists Importing data from spreadsheets

• flowCore Loading flow cytometry (FCS) data gating compensation transformation visualization

• flowFP Binning Fingerprinting Comparing multivariate distributions

• Writing your own functions• Installing and running R on your

computer• Suggestions for further reading

and reference

Background

• R Is an integrated suite of software facilities for data manipulation,

simulation, calculation and graphical display. It handles and analyzes data very effectively and it contains a suite of

operators for calculations on arrays and matrices. In addition, it has the graphical capabilities for very sophisticated graphs

and data displays. It is an elegant, object-oriented programming language. Started by Robert Gentleman and Ross Ihaka (hence “R”) in 1995

as a free, independent, open-source implementation of the S programming language (now part of Spotfire)

Currently, maintained by the R Core development team – an international group of hard-working volunteer developers

http://www.r-project.org

http://cran.r-project.org/doc/contrib/Owen-TheRGuide.pdf

Background

• Bioconductor “Is an open source and open development software project to provide

tools for the analysis and comprehension of genomic data.” Goals

To provide widespread access to a broad range of powerful statistical and graphical methods for the analysis of genomic data.

To provide a common software platform that enables the rapid development and deployment of extensible, scalable, and interoperable software.

To further scientific understanding by producing high-quality documentation and reproducible research.

To train researchers on computational and statistical methods for the analysis of genomic data.

http://bioconductor.org/overview

A motivating example

I’ve just collected data from a T cell stimulation experiment in a 96-well plate format. I need to gate the data on CD3/CD4. How consistent are the distributions, so that I can establish one set of gates for the whole plate?

A motivating example

Another motivating example

I’m concerned that drawing gates to analyze my data introduces unintended bias. Additionally, since I have multiple data files, drawing multiple gates is time consuming. Can I use R to compute gates and then apply these same objective gating criteria to multiple data files?

Another motivating example

Autogate lymphocytesand monocytes

Automatically analyzeFMO tubes

Back to the basics

• R is a command-line driven program

the prompt is: > you type a command

(shown in blue), and R executes the command and gives the answer (shown in black)

Simple example: enter a set of measurements

• use the function c() to combine terms together• Create a variable named mfi• Put the result of c() into mfi using the

assignment operator <- (you can also use =)• The [1] indicates that the result is a vector

Help, functions, polymorphism

> help (log)

> ?log

> apropos(“log”)

Vignettes – really good help!

Sequences and Repeats

Characters and Numbers

• Characters and character strings are enclosed in “” or ‘’

• Special numbers• NA – “Not Available”• Inf – “Infinity”• NaN – “Not a Number”

Vectors and Matrices

Vectors and Matrices

• The subset operator for vectors and matrices is [ ]

Vectors and Matrices

• You can extend the length of a vector via subsetting

… but not a matrix

Vectors and Matrices

• However, all’s not lost if you want to extend either the columns …

… or rows

Data Frames

• A Data Frame is like a matrix, except that the data type in each column need not be the same

Often, a Data Frame is created from an Excel spreadsheet using the function read.table()

Save As…a tab-delimitedtext file.

Data Frames from spreadsheets

Data Frames from spreadsheets

Data Frames from spreadsheets

Lists

Handling Flow Cytometry Data: flowCore

• flowCore is a base package that supports reading and manipulation of FCS data files

• The fundamental object that encapsulates the data in an FCS file is a flowFrame

• A container object that holds a collection of flowFrames is called a flowSet

• In the next slides we will go over reading an FCS file gating compensation transformation visualization

Check out the example data

Read an FCS file, summarize the flowFrame

Apply the lymphocyte gate with Subset

needs to be transformed becauseit is rendering the linear datain the FCS file

hasn’t been compensated!

• Lines require library(fields)

• Percentages are in summary(fres)$p[1:4]

• Percentages are drawn in the graph with text()

Fingerprinting Flow Cytometry Data: flowFP

• flowFP aims to transform flow cytometric data into a form amenable to

algorithmic analysis tools Acts as in intermediate step between acquisition of high-throughput

FCM data and empirical modeling, machine learning and knowledge discovery

Implements ideas from

Roederer M, Moore W, Treister A, Hardy RR & Herzenberg LA. Probability binning comparison: a metric for quantitating multivariate distribution differences. Cytometry 45:47-55, 2001.

Rogers WT, Moser AR, Holyst HA, Bantly A, Mohler ER III, Scangas G, and Moore JS, Cytometric Fingerprinting: Quantitative Characterization of Multivariate Distributions, Cytometry 73A: 430-441, 2008.

and

The basic idea

• Subdivide multivariate space into bins Call this a “model” of the space

• For each flowFrame in a flowSet, count the number of events in

each bin in the model• Flatten the collection of counts for a flowFrame into a 1D feature

vector• Combine all of the feature vectors together into a n x m matrix

n = number of flowFrames (instances) m = number of bins in the model (features)

• Also, tag each event with its bin membership facilitates visualization, interpretation can be used for gating

Probability Binning

Probability Binning

Probability Binning

Probability Binning

Bin

Nu

mb

er

> plot (mod, fs)

Class Constructors

• flowFPModel (base class) Consumes a flowFrame or flowSet Produces a model, which is a recipe for subdividing multivariate space

• flowFP Consumes a flowFrame or flowSet, and a flowFPModel Produces a flowFP, which represents the multivariate probability density

function as a fingerprint Also tags each event with its bin membership

• flowFPPlex Consumes a collection of flowFPs The flowFPPlex is a container object to facilitate handling large and

complex collections of flowFPs

Writing Your Own Functions

commentscomments

declarationdeclaration

assignmentassignment

returnreturn

code blockcode block

## It’s a good idea to comment your code#

myfunc <- function (arg1=10, arg2, ...){

# your code goes hereanswer <- log (arg1, base=arg2)

return (answer)}

Writing Your Own Functions

Obtaining R and Bioconductor

• R http://cran.r-project.org/

• Bioconductor http://bioconductor.org/GettingStarted

General Reference Material

• A good beginner’s guide to R http://cran.r-project.org/doc/contrib/Owen-TheRGuide.pdf

• A nice one-page reference card http://cran.r-project.org/doc/contrib/Short-refcard.pdf

• Outstanding summary of R/Bioconductor, with many examples http://manuals.bioinformatics.ucr.edu/home/

R_BioCondManual#R_favorite • The definitive reference for writing R extensions (advanced!)

http://cran.r-project.org/doc/manuals/R-exts.pdf• Books

William N. Venables and Brian D. Ripley. Modern Applied Statistics with S. Fourth Edition. Springer, New York, 2002. ISBN 0-387-95457-0.

John M. Chambers. Programming with Data. Springer, New York, 1998. ISBN 0-387-98503-4 (aka “the Green Book”)

Flow-Specific References

• Vignettes http://bioconductor.org/packages/2.6/bioc/vignettes/flowCore/inst/doc/HowTo-flowCore.pdf http://bioconductor.org/packages/2.6/bioc/vignettes/flowViz/inst/doc/filters.pdf http://bioconductor.org/packages/2.6/bioc/vignettes/flowStats/inst/doc/

GettingStartedWithFlowStats.pdf http://bioconductor.org/packages/2.6/bioc/vignettes/flowQ/inst/doc/

DataQualityAssessment.pdf http://bioconductor.org/packages/2.6/bioc/vignettes/flowFP/inst/doc/flowFP_HowTo.pdf

• Original Articles flowCore

Hahne, F., N. LeMeur, et al. (2009). "flowCore: a Bioconductor package for high throughput flow cytometry." BMC Bioinformatics 10: 106.

Fingerprinting Rogers, W. T., A. R. Moser, et al. (2008). "Cytometric fingerprinting: quantitative

characterization of multivariate distributions." Cytometry A 73(5): 430-41. Rogers, W. T. and H. A. Holyst (2009). "flowFP: A Bioconductor Package for

Fingerprinting Flow Cytometric Data." Advances in Bioinformatics 2009(Article ID 193947): 11.

Contact Me!

Wade Rogersrogersw@mail.med.upenn.edu267-350-9680 (o)610-368-5821 (m)

top related