session1 wed 23 1 13pcstevep11/expression2013/resources/milo1.pdf22/01/2013 3...

16
22/01/2013 1 Research Advanced Course. Liverpool, January 2013 Principles of Sta5s5cal Data Processing Marta Milo University of Sheffield Department of Biomedical Science Research Advanced Course. Liverpool, January 2013 Outline 09:0010:30 ( Marta Milo) Part I – Introduc-on to data analysis: General principles Defining pipelines Selec5on of an appropriate sta5s5cal model Part II Lowlevel analysis of the data Data normaliza5on and removal of ar5facts Diagnos5cs and ini5al visualiza5on Gene expression es5ma5on and data structure Differen5al Expression analysis and Clustering methods 10:3011:00 Coffee break 11:0011:45 Part III – Highlevel analysis of microarray data: Mul5ple sampling and determina5on of significance FDR rather than p value as an approach to significance Confounding factors and uncertainty on gene expression Morning – Marta Milo / Nicolò Fusi Research Advanced Course. Liverpool, January 2013 Outline (cont.) 11:4512:30 (Steve Paterson) Introduc5on to highthroughput sequencing data analysis 12:3013:30 Lunch Madisons, Sherrington Building 13:3014:30 Hands on session: Analysis of either microarray data and highthroughput sequencing data 14:3015:00 Coffee break 15:0016:30 Hands on session cont. Prac-cals (Steve Paterson)

Upload: others

Post on 08-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Session1 Wed 23 1 13pcstevep11/expression2013/resources/Milo1.pdf22/01/2013 3 Research(Advanced(Course.(Liverpool,(January(2013(RNA&sample&preparaon&for&microarray&assays& • RNA&need&to&be&pure&and&intact

22/01/2013  

1  

Research  Advanced  Course.  Liverpool,  January  2013  

Principles  of  Sta5s5cal  Data  Processing    

Marta  Milo  University  of  Sheffield    

Department  of  Biomedical  Science  

Research  Advanced  Course.  Liverpool,  January  2013  

Outline  09:00-­‐10:30  (  Marta  Milo)    

 Part  I  –  Introduc-on  to  data  analysis:    •  General  principles    •  Defining  pipelines    •  Selec5on  of  an  appropriate  sta5s5cal  model    

     Part  II  -­‐  Low-­‐level  analysis  of  the  data  •  Data  normaliza5on  and  removal  of  ar5facts  •  Diagnos5cs  and  ini5al  visualiza5on  •  Gene  expression  es5ma5on  and  data  structure  •  Differen5al  Expression  analysis  and  Clustering  methods  

     10:30-­‐11:00  Coffee  break        11:00-­‐11:45    

Part  III  –  High-­‐level  analysis  of  microarray  data:    •  Mul5ple  sampling  and  determina5on  of  significance    •  FDR  rather  than  p  value  as  an  approach  to  significance  •  Confounding  factors  and  uncertainty  on  gene  expression  

   

Morning  –  Marta  Milo  /  Nicolò  Fusi    

Research  Advanced  Course.  Liverpool,  January  2013  

Outline  (cont.)  

11:45-­‐12:30  (Steve  Paterson)  Introduc5on  to  high-­‐throughput  sequencing  data  analysis        12:30-­‐13:30  Lunch  -­‐  Madisons,  Sherrington  Building        13:30-­‐14:30  Hands  on  session:  Analysis  of  either  microarray  data  and  high-­‐throughput  sequencing  data      14:30-­‐15:00  Coffee  break      15:00-­‐16:30  Hands  on  session  cont.      

Prac-cals  (Steve  Paterson)  

Page 2: Session1 Wed 23 1 13pcstevep11/expression2013/resources/Milo1.pdf22/01/2013 3 Research(Advanced(Course.(Liverpool,(January(2013(RNA&sample&preparaon&for&microarray&assays& • RNA&need&to&be&pure&and&intact

22/01/2013  

2  

Research  Advanced  Course.  Liverpool,  January  2013  

1.   Importance   of   defining   your   research   ques5ons,  keeping   in   mind   limita5ons   and   effec5ve   use   of  the  data  

2.      Consistency  in  sample  prepara5on,  op5misa5on  of  the  samples,  extensive  QC  of  the  data.  LOOK  at  the  data  generated  and  QC  before  processing  

3.   Choose   the   correct   model   to   analyse   your   data,  define   appropriate   parameters   (RNA-­‐Seq   analysis)  to  get  the  maximum  informa5on  out  of  your  data    

4.   Use   the   best   tool   to   visualise   your   data,   to  discriminate,   cluster   and   rank   your   significant  targets    

5.    Using  of  pathway  analysis  for  defining  novel  hypothesis  that  can  be  inves5gated  with  “specific  tools”  ,  mathema5cal  and  experimental  

Research  Advanced  Course.  Liverpool,  January  2013  

Importance  of  Sample  prepara5on  

It  is  certainly  important  to  pause  here  and  THINK  about  the  nature  of  our  

samples  

Why  is  it  so  important  for  the  data  analysis?  

Research  Advanced  Course.  Liverpool,  January  2013  

RNA  sample  prepara5on  for  high-­‐throughput  assays  

Experimental  design:    your  RNA  collec5on  needs  to  reflect  the  biological  ques5ons  you  are  asking  

 Op-mise  protocols:  

 minimise  technical  errors    minimise  batch  effects    clean  and  pure  sample  to  avoid  contamina5on  

 Es-mate  the  correct  quan--es:  

 samples  for  valida5on    op5misa5on  of  the  protocols  –  avoid  satura5on  or  low  quan5fica5on  

 Technical  and  biological  replicates:  

 ensure  you  have  SOP  in  place      

Page 3: Session1 Wed 23 1 13pcstevep11/expression2013/resources/Milo1.pdf22/01/2013 3 Research(Advanced(Course.(Liverpool,(January(2013(RNA&sample&preparaon&for&microarray&assays& • RNA&need&to&be&pure&and&intact

22/01/2013  

3  

Research  Advanced  Course.  Liverpool,  January  2013  

RNA  sample  prepara5on  for  microarray  assays  

•  RNA  need  to  be  pure  and  intact  

•  Clean  from  genomic  DNA  and  solvents  

•  Use  RNA  extrac5on  protocols  that  are  suitable  for  microarray  assays  

•  Extensive  QC  at  each  step  

•  Consistent  technical  execu5on  of  the  protocols  

•  If  samples  extracted  from  cells/  5ssue  that  have  a  large  component  of  “redundant  RNA”  make  sure  you  deplete  the  sample  from  it.  

 whole  blood-­‐  hemoglobin  

•  Make  sure  that  the  process  of  cleaning  DO  NOT  compromise  the  quality  of  RNA  

Research  Advanced  Course.  Liverpool,  January  2013  

RNA  sample  prepara5on  for  microarray  assays  (cont.)  

Make  sure  that  the  sample  you  can  collect  is  sufficient  for  covering  your  experimental  design    

 technical  replica5on  :  ensure  you  have  enough  RNA    

 biological  replica5on  :  ensure  random  extrac5on  to  minimise  batch  effects      Complex  protocol,  make  sure  you  have  SOPs  in  place.      

Consistency  and  minimisa5on  of  technical  varia5on    

Research  Advanced  Course.  Liverpool,  January  2013  

RNA  sample  prepara5on  for  RNA-­‐Seq  

Be  very  specific  on  the  condi5ons  and  on  the  samples  you  are  collec5ng    

 Iden5fy  the  two  main  technical  prepara5on:    

   poly(A)  enrichment        

   ribosomal  deple5on      Extensive  QC  –      Prepara5on  of  the  library  -­‐-­‐-­‐  see  John  Kenny’s  presenta5on  

Page 4: Session1 Wed 23 1 13pcstevep11/expression2013/resources/Milo1.pdf22/01/2013 3 Research(Advanced(Course.(Liverpool,(January(2013(RNA&sample&preparaon&for&microarray&assays& • RNA&need&to&be&pure&and&intact

22/01/2013  

4  

Research  Advanced  Course.  Liverpool,  January  2013  

Key  ques5ons  we  need  to  consider  

How  much  do  I  know  about  my  system  and  the  species  I  am  studying?  

How  much  the  technologies  that  are  available  know  about  my  system?  

What  is  my  reference  genome?    How  do  make  it  “sensi5ve”  to  my  system?    

Are  the  signals  (reads)  specific  to  my  ques5ons?  If  not,  how  do  I  adjust  my  experimental  design  so  to  increase  sensi5vity  in  my  predic5ons?  

Is  high-­‐throughput  approach  the  correct  approach  for  my  research  ques5on?    

Ques-ons  that  we  can  leave  open  to  DISCUSSION  

Research  Advanced  Course.  Liverpool,  January  2013  

It  is  worth  spending  some  extra  5me  and  few  more  control  experiments  before  embargoing  into    gene  expression  studies  with  microarray  and  NGS  studies      The  data  is  more  interpretable,  the  predic5ons  are  possibly  more  robust,  even  if  the  number  of  significant  targets  appear  to  be  small    Predic5ons  are  olen  validated  –  when  robust.      

A  pearl  of  wisdom  

Research  Advanced  Course.  Liverpool,  January  2013  

All   the   varia5on   and   to   ensure   that   the   analysis   of   the   data   is   as  reproducible   as   the   experimental   collec5on   of   samples   generate   the  need  to  define  pipelines  for  the  analysis  of  the  data    

     

PIPELINES:  Reproducible  and  robust  protocols  for  numerical  experimenta;ons.  In  case  of  biological  data  they  are  tailored  to  the  

system/organism  under  study.      

Pipelines  

HOW  DO  WE  GET  THEM?    

Page 5: Session1 Wed 23 1 13pcstevep11/expression2013/resources/Milo1.pdf22/01/2013 3 Research(Advanced(Course.(Liverpool,(January(2013(RNA&sample&preparaon&for&microarray&assays& • RNA&need&to&be&pure&and&intact

22/01/2013  

5  

Research  Advanced  Course.  Liverpool,  January  2013  

Example  of  Pipelines  

hmp://www.liv.ac.uk/genomic-­‐research/bioinforma5cs/  

Microarray  pipeline  

Research  Advanced  Course.  Liverpool,  January  2013  

PART  II  

Research  Advanced  Course.  Liverpool,  January  2013  

Choose  the  appropriate  Sta5s5cal  Model  

Different  plaQorms  that  generate  gene  expression:    •  Two  or  One  color  spomed  cDNA  arrays  

•  Affymetrix    -­‐    new  HJAY    •  Illumina  Arrays  

•  Reads  –    RNA-­‐Seq  and  other  NGS  assays    

Interpret  and  analyse  the  data  by  first  understanding  where  the  data  is  coming  from  

Page 6: Session1 Wed 23 1 13pcstevep11/expression2013/resources/Milo1.pdf22/01/2013 3 Research(Advanced(Course.(Liverpool,(January(2013(RNA&sample&preparaon&for&microarray&assays& • RNA&need&to&be&pure&and&intact

22/01/2013  

6  

Research  Advanced  Course.  Liverpool,  January  2013  Wikipedia  –  DNA  microarrays  

How  do  I  quan-fy  the  gene  expression?    

 Image  processing  noise    Sensi5vity  

Research  Advanced  Course.  Liverpool,  January  2013  

20µm"

Millions of copies of a specific"oligonucleotide sequence element"

Image of Hybridised Array"

~ 1,000,000 different"complementary oligonucleotides ""

Single stranded, "labeled RNA sample"Oligonucleotide element"

1.28cm"

GeneChip® Array"

Slide courtesy of Affymetrix"

* **

**

Research  Advanced  Course.  Liverpool,  January  2013  J  Biomol  Tech.  2004  December;  15(4):  276–284.  

•  Rela5ve  expression  •  Important  to  choose  the  reference  •  Important  to  choose  the  experimental  

design  

•  Es5ma5on  of  Absolute  expression  •  No  reference    •  Not  specific  to  the  ques5on  you  are  asking  •  Important  the  design  –  data  analysis  

Page 7: Session1 Wed 23 1 13pcstevep11/expression2013/resources/Milo1.pdf22/01/2013 3 Research(Advanced(Course.(Liverpool,(January(2013(RNA&sample&preparaon&for&microarray&assays& • RNA&need&to&be&pure&and&intact

22/01/2013  

7  

Research  Advanced  Course.  Liverpool,  January  2013  

       _at  :  probe  sets  are  predicted  to  perfectly  match  only  a  single  transcript        _s_at  :  are  predicted  to  perfectly  match  mul5ple  transcripts,  which  may  be  from                            different  genes    _x_at  :probe  sets  will  contain  some  probes  that  are  iden5cal  or  highly  similar  to  other  

       sequences.  Hybridize  uniformly  across  probe  pairs  to  the  intended  target  

Probe  Set  Nota5on  

Research  Advanced  Course.  Liverpool,  January  2013  

HG_U133  Plus  v2    Affymetrix  genChip  The  sequences  from  which  these  probe  sets  were  derived  were  selected  from  GenBank®,  dbEST,  and  RefSeq.      A  single  array  contains  with  more  than  54,000  probe  sets  represen5ng  approximately  38,500  genes  (es5mated  by  UniGene  coverage).      70  percent  of  the  probe  sets  represent  subcluster  assemblies  containing  one  or  more  non-­‐  EST  sequences.  Of  the  16,737  EST-­‐based  probe  sets,  approximately  9,000  probe  sets  can  now  be  associated  with  an  mRNA  or  other  non-­‐EST  sequence.    

MM  specific  binding  affinity  Now  with  new  arrays  HJAY…..  

Research  Advanced  Course.  Liverpool,  January  2013  

   The  HJAY  array  plauorm  was  designed  using  content  from  ExonWalk  (C.  Sugnet),  Ensembl,  and  RefSeq  databases  (Na5onal  Center  for  Biotechnology  Informa5on  build  36).      It  interrogates  ∼  315,000  human  transcripts  from  ∼  35,000  genes  and  contains  ∼  260,000  junc5on  (JUC)  and  ∼  315,000  exonic  (PSR)  probe  sets.  A  frac5on  of  probe  sets  had  no-­‐unique  loca5ons  in  the  human  genome  and  were  likely  to  give  cross-­‐hybridiza5on  signal.    Lapuk  et  al.  used  only    501,557  of  probe  sets  from  23,546  transcript  clusters  were  retained.      

           STILL  LOADS  TO  DO….  

Affymetrix  GeneChip  HJAY  

Page 8: Session1 Wed 23 1 13pcstevep11/expression2013/resources/Milo1.pdf22/01/2013 3 Research(Advanced(Course.(Liverpool,(January(2013(RNA&sample&preparaon&for&microarray&assays& • RNA&need&to&be&pure&and&intact

22/01/2013  

8  

Research  Advanced  Course.  Liverpool,  January  2013  

What’s  in  the  data?  

How  we  define  a  measure  that  best  represent  the  absolute  expression  level  of  each  gene  on  the  chip?  

 

0  500  

1000  1500  2000  2500  

1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16   0  200  400  600  800  

1000  1200  1400  1600  

1   2   3   4   5   6   7   8   9   10  11  12  13  14  15  16  

PM  MM  

1.  Summarise  to  a  single  expression  level  the  probe  intensi5es  for  each              probe  set  

2.  Es5mate  the  varia5ons  introduced  by    background  effect    probe  affinity  effect  

3.  Some  PM/MM  pairs  are  more  reliable  than  others    4.  The  signal  needs  to  be  scaled  before  comparing  data  from  different  arrays  

Research  Advanced  Course.  Liverpool,  January  2013  

Use  single  point  sta-s-cs    make  use  of  the  informa5on  we  have  to  define  values  that  es5mate      gene  expression  

   MAS  5.    RMA  –  GCRMA    PLIER  

 Use  a  probabilis-c  approach  

 make  use  of  the  observed  data  to  es5mate  func5on  that  have      generated  that  data        Es5mates  of  gene  expression  will  be  the  most  probable  value  that      summarises  the  probe  set        PUMA  

The  approaches  

Research  Advanced  Course.  Liverpool,  January  2013  

Microarray Suite (MAS5.0)  

Signal  ~  TukeyBiweight(log2(PMj  –  IMj))    

•  Signal = Smoothed average over PM/MM pairs representing a gene

•  Signal is always positive: Absent - Present Call

Correction for global background.- based on 16 sectors on each array

Ideal mismatch (IM) intensity calculated from MM value and subtracted from PM.

- if MM < PM then IM = MM - if MM > PM then IM = PM – correction value

Page 9: Session1 Wed 23 1 13pcstevep11/expression2013/resources/Milo1.pdf22/01/2013 3 Research(Advanced(Course.(Liverpool,(January(2013(RNA&sample&preparaon&for&microarray&assays& • RNA&need&to&be&pure&and&intact

22/01/2013  

9  

Research  Advanced  Course.  Liverpool,  January  2013  

MAS5:  p-­‐value  and  calls  

•  First  calculate  discriminant  for  each  probe  pair:  R=(PM-­‐MM)/(PM+MM)  

•  Wilcoxon  one  sided  ranked  test  used  to  compare  R  vs  tau  value  and  determine  p-­‐value  

•  Present/Marginal/Absent  calls  are  thresholded  from  p=value  above  and    –  Present  =<  alpha1  –  alpha1  <  Marginal  <  alpha2  –  Alpha2  <=  Absent    

•  Default:  alpha1=0.04,  alpha2=0.06,  tau=0.015  

Not  very  precise,  accurate  only  when  many  replicates  are  available.  Dependent  strongly  on  MM,  Uses  linear  scaling  normalisa5on  

Research  Advanced  Course.  Liverpool,  January  2013  

•  Subtract background for each array from PM

•  Intensity- dependent normalisation of PM-Bkgd

•  Log transform

•  Robust multichip analysis of all PM reporters in the set using Tukey median polishing procedure

•  Quantile normalisation :Fit  all  the  chips  to  the  same  distribu5on.  Scale  the  chips  so  that  they  have  the  same  mean.  

Robust Multi-array Average (RMA)  

Signal  ~  Tukey (log2(PMj  –  bkgdj))    

Research  Advanced  Course.  Liverpool,  January  2013  

RMA  Assump-ons:    1.  log  transformed,  background  corrected  expression  values  follow  a  linear  

model,    

2.  linear  Model  is  es5mated  by  using  a  “median  polish”  algorithm  

3.  needs  replicates  Used  with  groups  of  chips  (>3),  more  chips  are  bemer  

4.  assumes  all  chips  have  same  background,  distribu5on  of  values.  

5.  does  not  use  the  MM  probes  as  (PM-­‐MM*)  leads  to  high  variance  

6.  ignoring  MM  decreases  accuracy,  increases  precision  

Page 10: Session1 Wed 23 1 13pcstevep11/expression2013/resources/Milo1.pdf22/01/2013 3 Research(Advanced(Course.(Liverpool,(January(2013(RNA&sample&preparaon&for&microarray&assays& • RNA&need&to&be&pure&and&intact

22/01/2013  

10  

Research  Advanced  Course.  Liverpool,  January  2013  

Robust Multi-array Average (GCRMA)  

MM  specific  binding  affinity  

Need  to  model  that  and  include  it  in  the  es5ma5on  of  the  signal  -­‐-­‐-­‐  GC  content  of  the  probes  

Background  adjustment:  based  on  sequence  specificity  brightness  in  the  PM  probes.      

Although  it  is  model  based  approach:  defines  model  then  tries  to  fit  experimental  data  to  the  model.  DOES  need  mul5ple  samples!      Assump5on:  The  input  is  a  group  of  samples  that  have  same  distribu5on  of  intensi5es.     Nature  Biotechnology  22,  656  -­‐  658  (2004)    

doi:10.1038/nbt0604-­‐656b    

Research  Advanced  Course.  Liverpool,  January  2013  

The  PLIER  algorithm  was  developed  by  Affymetrix  and  released  in  2004.  It  is  part  of  several  commercially  available    Incorporates  experimental  observa5ons  of  feature  behavior.    It  uses  a  probe  affinity  parameter,  which  represents  the  strength  of  a  signal  produced  at  a  specific  concentra5on  for  a  given  probe.      The  probe  affini5es  are  calculated  using  data  across  arrays.      The  error  model  employed  by  PLIER  assumes  error  is  propor5onal  to  observed  intensity,  rather  than  to  background-­‐subtracted  intensity.      It  assumes  that  the  error  of  the  mismatch  probe  is  the  reciprocal  of  the  error  of  the  perfect  match  probe.    

Probe  Logarithmic  Intensity  Error  (PLIER)    

Improved  precision  over  MAS  5    

Research  Advanced  Course.  Liverpool,  January  2013  

Models  {PM,MM}  distribu5ons  for  each  probe-­‐set  

 

signal

0

PM

prob

abili

ty

values

MM

puma:  gMOS  and  mulN-­‐mgMOS    

Gamma  func5ons  are  used  to  model  the  posi5ve  probe  intensi5es.      mgMOS:  MM  and  PM  are  drown  from  a  joint  probability  distribu5on:    

The  bgj  are  latent  variables  reflecNng  the  different  binding  affinity  of  probes  within  the  probe-­‐set  

where  ygjc  and  mgjc  represent,  respec5vely,  PM  and  MM  intensi5es  of  the  j  -­‐th  probe-­‐pair  in  the  g-­‐th  probe  set  on  the  c-­‐th  chip  (a  gamma  distribu5on  with  the  same  inverse  scale  parameter  bgj  which  is  probe-­‐pair  specific)    The  shape  parameters  of  the  two  gamma  distribu5ons  are  the  sum  of  the  background  term  agc  and  the  true  specific  hybridiza5on  signal  term  αgc  which  are  probe-­‐set  and  chip  specific.  

Computa5onally  efficient  method  is  used  for  es5ma5ng  the  posterior  distribu5on  of  the  signal:  Posterior  is  unimodal  and  approximated  with  a  truncated  Gaussian.    

p(ygj ,mgj ) = dbgj p(bgj )p(ygj ,mgj |∫ ag,αg ,bgj )

ygjc ~ Ga(agc +αgc,bgj ),mgjc ~ Ga(agc + φαgc,bgj )mmgMOS:  specific  MM  binding  and  mul5ple  informa5on  across  chips  

Page 11: Session1 Wed 23 1 13pcstevep11/expression2013/resources/Milo1.pdf22/01/2013 3 Research(Advanced(Course.(Liverpool,(January(2013(RNA&sample&preparaon&for&microarray&assays& • RNA&need&to&be&pure&and&intact

22/01/2013  

11  

Research  Advanced  Course.  Liverpool,  January  2013  

Normalisa5on    

Why  do  we  need  to  normalise  the  data?    

   1.  we  want  to  compare  across  chips      2.  we  need  to  ensure  that  all  the  data  is  equally  compared  across  baseline              within  the  chip        

   Most  methods  will  have  normalisa5on  step  incorporated,  some  other  will  need  to  perform  it  aler  gene  expression  es5ma5on      Scaling  –  Mean  and  Median    Quan5le    Loess    

Research  Advanced  Course.  Liverpool,  January  2013  

Normalisa5on:  scaling  

The  assump5on  that  mapping  using  quan5les  or  scaling  is  reasonable  is  based  on  the  assump5on  that  “most  genes  don’t  change”,  and  quan5les  use  this  more  extensively  than  scaling.      If  this  underlying  assump5on  is  doubuul,  then  using  the  above  methods  is  not  advisable.  

Simply  linearly  scale  the  gene  expression  so  that  the  overall  mean  /  median  are  the  same.    The  median  is  more  scale-­‐invariant,  but  for  the  most  part  there  is  limle  prac5cal  difference.    

Research  Advanced  Course.  Liverpool,  January  2013  

Normalisa5on:  quan5le  

Assume  that  the  distribu5ons  of  probe  intensi5es  should  be  completely  the  same  across  chips.    Start  with  n  arrays,  and  p  probes,  and  form  a  [p,n]  matrix  X.    Sort  the  columns  of  X,  so  that  the  entries  in  a  given  row  correspond  to  a  fixed  quan5le.    Replace  all  entries  in  that  row  with  their  mean  value.  

Page 12: Session1 Wed 23 1 13pcstevep11/expression2013/resources/Milo1.pdf22/01/2013 3 Research(Advanced(Course.(Liverpool,(January(2013(RNA&sample&preparaon&for&microarray&assays& • RNA&need&to&be&pure&and&intact

22/01/2013  

12  

Research  Advanced  Course.  Liverpool,  January  2013  

Two-­‐  color  array  normalisa5on  

Global  Normalisa-on  methods  assume  the  two  dyes  are  related  by  a  constant  factor  

Local  normalisa-on  methods  assume  that  the  dye  factor  is  dependent  on:  

― Spot  intensity  (defined  as  A=RG).  ― Loca5on  on  the  array.    

Most  common  methods  are:    print-­‐5p  effect  correc5on    intensity  dependent      Loess  

Research  Advanced  Course.  Liverpool,  January  2013  

•  Visualise  the  effect:  M-­‐A  plot                

•  Correc5on  of  the  intensity  dependant  varia5ons:  

     

Research  Advanced  Course.  Liverpool,  January  2013  

LOESS  normalisa5on  LOcally (W)Eighted polynomial regreSSion.

Cleveland, W. S. (1979). Robust locally weighted regression and smoothing scatterplots.

JASA 74 829-836.

                       M  =  Adjusted  Log  Re

d  –  Ad

justed

   Log  Green

           

A  =    (Adjusted  Log  Green  +  Adjusted  Log  Red)  /  2    

Page 13: Session1 Wed 23 1 13pcstevep11/expression2013/resources/Milo1.pdf22/01/2013 3 Research(Advanced(Course.(Liverpool,(January(2013(RNA&sample&preparaon&for&microarray&assays& • RNA&need&to&be&pure&and&intact

22/01/2013  

13  

Research  Advanced  Course.  Liverpool,  January  2013  

Benchmarks  and  comparisons    

hmp://affycomp.biostat.jhsph.edu/  Affycomp  III:  A  Benchmark  for  Affymetrix  GeneChip  Expression  Measures    

Alterna5ves  to  microarrays:    

   NGS  gene  expression  quan5fica5on    (  e.g  RNA-­‐seq)      Affymetrix  GeneChip  Junc5on  Arrays  

 There   are   not   established   sites   and   plauorms   for   this   yet.   Benchmark  datesets  are  available    SEQanswers  -­‐    the  NGS  community  hmp://seqanswers.com/forums/showthread.php?t=10797  

   The  MicroArray  Quality  Control  (MAQC)-­‐II  study    Nat.  Biotech  (2010)  

Research  Advanced  Course.  Liverpool,  January  2013  

For  NGS  gene  expression  quan-fica-on….    

A  lot  changes  …  but            

•  Es5ma5on  of  gene  expression  •  Alterna5ve  splicing  iden5fica5on  •  Alterna5ve  isoform  detec5on  •  Transcripts  abundance    

•  Normalise  read  counts  •  Normalise  reads  between  lanes  •  Normalise  reads  against  transcripts            abundance  and  gene  length  •  Varying  sequencing  depth  •  Other  technical  effects  

Visualise  and  interpret  Differen5al  expression  Clustering  

QUANTIFICATION  

NORMALISATION  

HIGH  LEVEL  ANALYSIS  

Research  Advanced  Course.  Liverpool,  January  2013  

Data  Visualisa-on    

       

Scamer  Plot  

Box  Plot  

Slide  2  Cy3  Cy5  Slide  1  

Cy3  Cy5  

median  

Q3=75th  percen5le  

Q1=25th  percen5le  

minimum  

maximum  

MA  Plot  

Log  Abundance  

Log  Fold  Change  

Page 14: Session1 Wed 23 1 13pcstevep11/expression2013/resources/Milo1.pdf22/01/2013 3 Research(Advanced(Course.(Liverpool,(January(2013(RNA&sample&preparaon&for&microarray&assays& • RNA&need&to&be&pure&and&intact

22/01/2013  

14  

Research  Advanced  Course.  Liverpool,  January  2013  

Principal  Component  Analysis    

It  is  one  of  the  most  commonly  used  technique  to  visualise  and  interpret  high  dimensional  data    It  iden5fies  the  maximum  spread  of  the  data  maximising  the  variance  by  rota5ng  the  space  where  the  data  lives.      It  uses  a  set  of  variables  that  are  hidden  to  the  user  and  are  implicitly  explained  by  the  data  (latent  variables)    Every  direc5on  found  that  extract  informa5ve  features  from  the  “noisy”  cloud  of  data  points  is  called  a  principal  component      Dimensionality  reduc5on        

Research  Advanced  Course.  Liverpool,  January  2013  

Principal  Component  Analysis  (cont…)    

Y Haile-Selassie et al. Nature 483, 565-569 (2012) doi:10.1038/nature10922

PC1  

PC2  

usually  reasonable,  but  it  assumes  that  the  uncertainty  associated  to  each  gene  is  constant      non-­‐linear  transforma5on  of  gene  expression  (Huber  et  al.  2002),PUMA  PCA  (Sanguine�  et  al.,  2005)  

Research  Advanced  Course.  Liverpool,  January  2013  

•  basic  idea:  group  together  genes  that  have  similar  pamern  of  expression  across  condi5ons  or  across  5me  

•   what  do  we  mean  by  similar?  

•  different  measures  of  similarity:  Euclidean  distance,  angle  •  between  vectors,  correla5on  coefficient,  .  .  .  

•  Shared  pamern  of  expression  might  be  associate  to  similar  func5ons  

Clustering    

Page 15: Session1 Wed 23 1 13pcstevep11/expression2013/resources/Milo1.pdf22/01/2013 3 Research(Advanced(Course.(Liverpool,(January(2013(RNA&sample&preparaon&for&microarray&assays& • RNA&need&to&be&pure&and&intact

22/01/2013  

15  

Research  Advanced  Course.  Liverpool,  January  2013  

•  builds  a  hierarchy  of  clusters  

•   bomom  up  (merging  clusters)  or  top  down  (spli�ng  clusters)  

•   Eisen  et  al.  (1998).  The  genes  that  are  most  correlated  are  joined  together,  the  expression  value  for  the  resul5ng  node  is  the  average  expression  of  the  two  (or  more)  genes.  The  similarity  matrix  is  then  updated  with  the  new  node.  

•   Different  similarity  measures  lead  to  different  interpreta5ons  

Hierarchical  Clustering  

Research  Advanced  Course.  Liverpool,  January  2013  

Given  two  gene  expression  values  x  and  y  the  fold  change  is  defined  as                            FC=  x/y  

 Given  two  vectors  xj  and  yj  of  gene  expression  measurements  for  controls  and  cases  for  GENE  j,  the  fold  change  is  defined  as  

         FCj  =  μj  /  μj      It  can  also  appear  as  a  difference.  

The  Fold  Change  

Research  Advanced  Course.  Liverpool,  January  2013  

Differen-al  expression  Analysis  

GOAL:  Iden5fy  the  most  differen5ally  expressed  genes  across  different  condi5ons  or  cases  and  controls.    HOW:  

   iden5fy  a  threshold  that  “define”  differen5al  expression          FC  values          p-­‐values                we  need  to  quan5fy  False  Discovery  Rate        

What  happens  if  the  sample  size  is  small?  •   The  fold-­‐change  becomes  very  sensi5ve  to  outliers  •   The  t-­‐test  becomes  very  sensi5ve  to  small  variances  

Page 16: Session1 Wed 23 1 13pcstevep11/expression2013/resources/Milo1.pdf22/01/2013 3 Research(Advanced(Course.(Liverpool,(January(2013(RNA&sample&preparaon&for&microarray&assays& • RNA&need&to&be&pure&and&intact

22/01/2013  

16  

Research  Advanced  Course.  Liverpool,  January  2013  

The  problem  of  Mul5ple  sampling  

Part  III