algorithmic+and+statistical+perspectives++ on+large6scale...

33
Algorithmic and Statistical Perspectives on LargeScale Data Analysis ... and Implications for Scientific UQ/ML Michael W. Mahoney ICSI and Dept of Statistics, UC Berkeley June 2018 (For more info, see: http://www.stat.berkeley.edu/~mmahoney )

Upload: others

Post on 05-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Algorithmic+and+Statistical+Perspectives++ on+Large6Scale ...hyperion.usc.edu/UQ-SciML-2018/Mahoney.pdf · (Lin and Altman (2005) Am J Hum Genet) • The figure renders visual support

Algorithmic  and  Statistical  Perspectives    on  Large-­‐Scale  Data  Analysis  ...    

and  Implications  for  Scientific  UQ/ML  

Michael  W.  Mahoney  

 ICSI  and  Dept  of  Statistics,  UC  Berkeley  

 June  2018  

 (For  more  info,  see:  http://www.stat.berkeley.edu/~mmahoney)  

   

Page 2: Algorithmic+and+Statistical+Perspectives++ on+Large6Scale ...hyperion.usc.edu/UQ-SciML-2018/Mahoney.pdf · (Lin and Altman (2005) Am J Hum Genet) • The figure renders visual support

BIG  data???  MASSIVE  data????  

NYT,  Feb  11,  2012:  “The  Age  of  Big  Data”      •   “What  is  Big  Data?  A  meme  and  a  marketing  term,  for  sure,  but  also  shorthand  for  advancing  trends  in  technology  that  open  the  door  to  a  new  approach  to  understanding  the  world  and  making  decisions.  …”  

Why  are  big  data  big?    •   Generate  data  at  different  places/times  and  different  resolutions  

•   Factor  of  10  more  data  is  not  just  more  data,  but  different  data  

 

Page 3: Algorithmic+and+Statistical+Perspectives++ on+Large6Scale ...hyperion.usc.edu/UQ-SciML-2018/Mahoney.pdf · (Lin and Altman (2005) Am J Hum Genet) • The figure renders visual support

BIG  data???  MASSIVE  data????  

MASSIVE  data:    •   Internet,  Customer  Transactions,  Astronomy/HEP  =  “Petascale”  

•   One  Petabyte  =  watching  20  years  of  movies  (HD)  =  listening  to  20,000  years  of  MP3  (128  kbits/sec)  =  way  too  much  to  browse  or  comprehend  

massive  data:    •   105  people  typed  at  106  DNA  SNPs;  106  or  109  node  social  network;  etc.    

In  either  case,  main  issues:    •   Memory  management  issues,  e.g.,  push  computation  to  the  data    

•   Hard  to  answer  even  basic  questions  about  what  data  “looks  like”  

 

Page 4: Algorithmic+and+Statistical+Perspectives++ on+Large6Scale ...hyperion.usc.edu/UQ-SciML-2018/Mahoney.pdf · (Lin and Altman (2005) Am J Hum Genet) • The figure renders visual support

Thinking  about  large-­‐scale  data  

Data  generation  is  modern  version  of  microscope/telescope*:    •   See  things  couldn't  see  before:  e.g.,  fine-­‐scale  movement  of  people,  fine-­‐scale  clicks  and  interests;  fine-­‐scale  tracking  of  packages;  fine-­‐scale  measurements  of  temperature,  chemicals,  etc.  

•   Those  inventions  ushered  new  scientific  eras  and  new  understanding  of  the  world  and  new  technologies  to  do  stuff  

*But  algorithms  are  also  a  “microscope/telescope”  to  probe  data.    

Easy  things  become  hard  and  hard  things  become  easy:    •   Easier  to  see  the  other  side  of  universe  than  bottom  of  ocean  

•   Means,  sums,  medians,  correlations  is  easy  with  small  data  

Our  ability  to  generate  data  far  exceeds  our  ability  to  extract  insight  from  data.  

Page 5: Algorithmic+and+Statistical+Perspectives++ on+Large6Scale ...hyperion.usc.edu/UQ-SciML-2018/Mahoney.pdf · (Lin and Altman (2005) Am J Hum Genet) • The figure renders visual support

How  do  we  view  BIG  data?  

Page 6: Algorithmic+and+Statistical+Perspectives++ on+Large6Scale ...hyperion.usc.edu/UQ-SciML-2018/Mahoney.pdf · (Lin and Altman (2005) Am J Hum Genet) • The figure renders visual support

Algorithmic  &  Statistical  Perspectives  ...  

Computer  Scientists    •   Data:  are  a  record  of  everything  that  happened.    •   Goal:  process  the  data  to  find  interesting  patterns  and  associations.  •   Methodology:  Develop  approximation  algorithms  under  different  models  of  data  access  since  the  goal  is  typically  computationally  hard.    

Statisticians  (and  Natural  Scientists,  etc)  •   Data:  are  a  particular  random  instantiation  of  an  underlying  process  describing  unobserved  patterns  in  the  world.  •   Goal:  is  to  extract  information  about  the  world  from  noisy  data.  •   Methodology:  Make  inferences  (perhaps  about  unseen  events)  by  positing  a  model  that  describes  the  random  variability  of  the  data  around  the  deterministic  model.    

Lambert  (2000),  Mahoney  (2010)      

Page 7: Algorithmic+and+Statistical+Perspectives++ on+Large6Scale ...hyperion.usc.edu/UQ-SciML-2018/Mahoney.pdf · (Lin and Altman (2005) Am J Hum Genet) • The figure renders visual support

...  are  VERY  different  paradigms  

Statistics,  natural  sciences,  scientific  computing,  etc:    •   Problems  often  involve  computation,  but  the  study  of  computation  per  se  is  secondary  •   Only  makes  sense  to  develop  algorithms  for  well-­‐posed*  problems  •   First,  write  down  a  model,  and  think  about  computation  later    Computer  science:  •   Easier  to  study  computation  per  se  in  discrete  settings,  e.g.,  Turing  machines,  logic,  complexity  classes    •   Theory  of  algorithms  divorces  computation  from  data  •   First,  run  a  fast  algorithm,  and  ask  what  it  means  later  

*Solution  exists,  is  unique,  and  varies  continuously  with  input  data  

Page 8: Algorithmic+and+Statistical+Perspectives++ on+Large6Scale ...hyperion.usc.edu/UQ-SciML-2018/Mahoney.pdf · (Lin and Altman (2005) Am J Hum Genet) • The figure renders visual support

Anecdote  1:    Randomized  Matrix  Algorithms  

How  to  “bridge  the  gap”?  •   decouple  (implicitly  or  explicitly)  randomization  from  linear  algebra  

•   importance  of  statistical  leverage  scores!  

Theoretical  origins  •   theoretical  computer  science,  convex  analysis,  etc.  

•   Johnson-­‐Lindenstrauss  

•   Additive-­‐error  algs  •   Good  worst-­‐case  analysis  

•   No  statistical  analysis  

•   No  implementations  

Practical  applications  •   NLA,  ML,  statistics,  data  analysis,  genetics,  etc  

•   Fast  JL  transform  

•   Relative-­‐error  algs  •   Numerically-­‐stable  algs  

•   Good  statistical  properties  •   Beats  LAPACK  &  parallel-­‐distributed  implementations  on  terabytes  of  data  

Mahoney  “Algorithmic  and  Statistical  Perspectives  on  Large-­‐Scale  Data  Analysis”  (2010)  Mahoney  “Randomized  Algorithms  for  Matrices  and  Data”  (2011)      

Page 9: Algorithmic+and+Statistical+Perspectives++ on+Large6Scale ...hyperion.usc.edu/UQ-SciML-2018/Mahoney.pdf · (Lin and Altman (2005) Am J Hum Genet) • The figure renders visual support

Anecdote  2:    Communities  in  large  informatics  graphs  

People  imagine  social  networks  to  look  like:  

Mahoney  “Algorithmic  and  Statistical  Perspectives  on  Large-­‐Scale  Data  Analysis”  (2010)  Leskovec,  Lang,  Dasgupta,  &  Mahoney  “Community  Structure  in  Large  Networks  ...”  (2009)      

How  do  we  know  this  plot  is  “correct”?    

•   (since  computing  conductance  is  intractable)  

•   Lower  Bound  Result;  Structural  Result;  Modeling  Result;  Etc.  

•   Algorithmic  Result  (ensemble  of  sets  returned  by  different  approximation  algorithms  are  very  different)  

•   Statistical  Result  (Spectral  provides  more  meaningful  communities  than  flow)    

Real  social  networks  actually  look  like:  

Size-­‐resolved  conductance  (degree-­‐weighted  expansion)  plot  looks  like:  

Data  are  expander-­‐like  at  large  size  scales  !!!  

There  do  not  exist  good  large  clusters  in  these  graphs  !!!  

Page 10: Algorithmic+and+Statistical+Perspectives++ on+Large6Scale ...hyperion.usc.edu/UQ-SciML-2018/Mahoney.pdf · (Lin and Altman (2005) Am J Hum Genet) • The figure renders visual support

Anecdote  3:    Approx.  comp.  and  implicit  regularization  Mahoney  “Approximate  Computation  and  Implicit  Regularization  for  Very  Large-­‐scale  Data  Analysis”  (2012)  

Explicitly-­‐imposed  regularization  •   Traditionally,  regularization  uses  explicit  norm  constraint  to  make  sure  solution  vector  is  “small”  and  not-­‐too-­‐complex  

•   min  ||f||+λ||g(x)||    

 

(X,d) (X’,d’)

x

y d(x,y) f

f(x)

f(y)

Implicitly-­‐imposed  regularization  •   Binning,  pruning,  early  stopping,  etc.  

•   Design  decisions  engineers  make  

•     Approximation  algorithms  implicitly  embed  data  in  a  “nice”  metric/geometric  place  and  then  round  the  solution.  

Big  question:  Can  we  formalize  the  notion  that/when  approximate  computation  in  and  of  itself  can  implicitly  lead  to  “better”  or  “more  regular”  solutions  than  exact  computation?  (Short  answer:  yes!)  

Page 11: Algorithmic+and+Statistical+Perspectives++ on+Large6Scale ...hyperion.usc.edu/UQ-SciML-2018/Mahoney.pdf · (Lin and Altman (2005) Am J Hum Genet) • The figure renders visual support

Lessons  from  the  anecdotes  

We  are  being  forced  to  engineer  a  union  between  two  very  different  worldviews  on  what  are  fruitful  ways  to  view  the  data  •   in  spite  of  our  best  efforts  not  to    

Often  fruitful  to  consider  the  statistical  properties  implicit  in  worst-­‐case  algorithms  •   rather  that  first  doing  statistical  modeling  and  then  doing  applying  a  computational  procedure  as  a  black  box  

•   for  both  anecdotes,  this  was  essential  for  leading  to  “useful  theory”    

How  to  extend  these  ideas  to  “bridge  the  gap”  b/w  the  theory  and  practice  of  MMDS  (Modern  Massive  Data  Set)  analysis.    

 

 

Mahoney  “Algorithmic  and  Statistical  Perspectives  on  Large-­‐Scale  Data  Analysis”  (2010)      

Page 12: Algorithmic+and+Statistical+Perspectives++ on+Large6Scale ...hyperion.usc.edu/UQ-SciML-2018/Mahoney.pdf · (Lin and Altman (2005) Am J Hum Genet) • The figure renders visual support

Outline  

•   “Algorithmic”  and  “statistical”  perspectives  on  data  problems  

•   Genetics  (and  other  scientific)  application      DNA  SNP  analysis  -­‐-­‐>  choose  columns  from  a  matrix  

 PMJKPGKD,  Genome  Research  ’07;  PZBCRMD,  PLOS  Genetics  ’07;  Mahoney  and  Drineas,  PNAS  ’09;  DMM,  SIMAX  ‘08;  BMD,  SODA  ’09;  etc.  

•   Internet  application    Community  finding  -­‐-­‐>  partitioning  a  graph  

 LLDM  (WWW  ‘08  &  TR  ‘08-­‐IM‘09  &  WWW  ‘10)    

Key  issue:  what  was  going  on  “under  the  hood”  in  these  two  applications  -­‐-­‐-­‐  use  statistical  properties  implicit  in  worst-­‐case  algorithms  to  make  domain-­‐specific  claims!  

Page 13: Algorithmic+and+Statistical+Perspectives++ on+Large6Scale ...hyperion.usc.edu/UQ-SciML-2018/Mahoney.pdf · (Lin and Altman (2005) Am J Hum Genet) • The figure renders visual support

Statistical leverage, coherence, etc.

Definition: Given a “tall” n x d matrix A, i.e., with n > d, let U be any n x d orthogonal basis for span(A), & let the d-vector U(i) be the ith row of U. Then:

•  the statistical leverage scores are

λi = ||U(i)||22 , for i ε {1,…,n}

•  the coherence is

γ = maxi ε {1,…,n} λi

•  the (i,j)-cross-leverage scores are

U(i)T U(j) = <U(i) ,U(j)>

Mahoney and Drineas (2009, PNAS); Drineas, Magdon-Ismail, Mahoney, and Woodruff (2012, JMLR)

Page 14: Algorithmic+and+Statistical+Perspectives++ on+Large6Scale ...hyperion.usc.edu/UQ-SciML-2018/Mahoney.pdf · (Lin and Altman (2005) Am J Hum Genet) • The figure renders visual support

Single Nucleotide Polymorphisms: the most common type of genetic variation in the genome across different individuals.

They are known locations at the human genome where two alternate nucleotide bases (alleles) are observed (out of A, C, G, T).

SNPs

indi

vidu

als

… AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG …

… GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG GG TT TT CC GG TT GG GG TT GG AA …

… GG TT TT GG TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG …

… GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT AG CT AG GT GT GA AG …

… GG TT TT GG TT CC CC CC CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT AG GG TT GG AA …

… GG TT TT GG TT CC CC CG CC AG AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG TT GG AA …

… GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG AA …

Matrices including thousands of individuals and hundreds of thousands if SNPs are available.

Applications in: Human Genetics

Page 15: Algorithmic+and+Statistical+Perspectives++ on+Large6Scale ...hyperion.usc.edu/UQ-SciML-2018/Mahoney.pdf · (Lin and Altman (2005) Am J Hum Genet) • The figure renders visual support

HGDP data

•  1,033 samples

•  7 geographic regions •  52 populations

Cavalli-Sforza (2005) Nat Genet Rev

Rosenberg et al. (2002) Science

Li et al. (2008) Science

The International HapMap Consortium (2003, 2005, 2007) Nature

Apply SVD/PCA on the (joint) HGDP and HapMap Phase 3 data.

Matrix dimensions:

2,240 subjects (rows)

447,143 SNPs (columns)

Dense matrix: over one billion entries

The Human Genome Diversity Panel (HGDP)

ASW, MKK, LWK, & YRI

CEU

TSI JPT, CHB, & CHD

GIH

MEX

HapMap Phase 3 data

•  1,207 samples

•  11 populations

HapMap Phase 3

Page 16: Algorithmic+and+Statistical+Perspectives++ on+Large6Scale ...hyperion.usc.edu/UQ-SciML-2018/Mahoney.pdf · (Lin and Altman (2005) Am J Hum Genet) • The figure renders visual support

The Singular Value Decomposition (SVD)

ρ: rank of A

U (V): orthogonal matrix containing the left (right) singular vectors of A.

Σ: diagonal matrix containing σ1 ≥ σ2 ≥ … ≥ σρ, the singular values of A.

The formal definition:

Given any m x n matrix A, one can decompose it as:

Important: Keeping top k singular vectors provides “best” rank-k approximation to A (w.r.t. Frobenius norm, spectral norm, etc.):

Ak = argmin{ ||A-X||2,F : rank(X) ≤ k }.

Page 17: Algorithmic+and+Statistical+Perspectives++ on+Large6Scale ...hyperion.usc.edu/UQ-SciML-2018/Mahoney.pdf · (Lin and Altman (2005) Am J Hum Genet) • The figure renders visual support

Africa

Middle East

South Central Asia

Europe

Oceania

East Asia

America

Gujarati Indians

Mexicans

•  Top two Principal Components (PCs or eigenSNPs) (Lin and Altman (2005) Am J Hum Genet)

•  The figure renders visual support to the “out-of-Africa” hypothesis.

•  Mexican population seems out of place: we move to the top three PCs.

Paschou, Lewis, Javed, & Drineas (2010) J Med Genet

Page 18: Algorithmic+and+Statistical+Perspectives++ on+Large6Scale ...hyperion.usc.edu/UQ-SciML-2018/Mahoney.pdf · (Lin and Altman (2005) Am J Hum Genet) • The figure renders visual support

Africa Middle East

S C Asia & Gujarati Europe

Oceania

East Asia

America

Not altogether satisfactory: the principal components are linear combinations of all SNPs, and – of course – can not be assayed!

Can we find actual SNPs that capture the information in the singular vectors?

Formally: spanning the same subspace.

Paschou, Lewis, Javed, & Drineas (2010) J Med Genet

Page 19: Algorithmic+and+Statistical+Perspectives++ on+Large6Scale ...hyperion.usc.edu/UQ-SciML-2018/Mahoney.pdf · (Lin and Altman (2005) Am J Hum Genet) • The figure renders visual support

Applications in: Astronomy

CMB Surveys (pixels) n  1990 COBE 1000 n  2000 Boomerang 10,000 n  2002 CBI 50,000 n  2003 WMAP 1 Million n  2008 Planck 10 Million

Galaxy Redshift Surveys (obj) •  1986 CfA 3500 •  1996 LCRS 23000 •  2003 2dF 250000 •  2008 SDSS 1000000 •  2012 BOSS 2000000 •  2012 LAMOST 2500000

Angular Galaxy Surveys (obj) •  1970 Lick 1M •  1990 APM 2M •  2005 SDSS 200M •  2011 PS1 1000M •  2020 LSST 30000M

Time Domain •  QUEST •  SDSS Extension survey •  Dark Energy Camera •  Pan-STARRS •  LSST…

“The Age of Surveys” – generate petabytes/year …

Szalay (2012, MMDS)

Page 20: Algorithmic+and+Statistical+Perspectives++ on+Large6Scale ...hyperion.usc.edu/UQ-SciML-2018/Mahoney.pdf · (Lin and Altman (2005) Am J Hum Genet) • The figure renders visual support

Galaxy properties from galaxy spectra

Continuum Emissions Spectral Lines

Can we select “informative” frequencies (columns) or images (rows) “objectively”?

4K x 1M SVD Problem: ideal for randomized matrix algorithms

Szalay (2012, MMDS)

Page 21: Algorithmic+and+Statistical+Perspectives++ on+Large6Scale ...hyperion.usc.edu/UQ-SciML-2018/Mahoney.pdf · (Lin and Altman (2005) Am J Hum Genet) • The figure renders visual support

Galaxy diversity from PCA

[Average Spectrum]

[Stellar Continuum]

[Finer Continuum Features + Age]

[Age] Balmer series hydrogen lines

[Metallicity] Mg b, Na D, Ca II Triplet

1st

2nd

3rd

4th

5th

PC

Page 22: Algorithmic+and+Statistical+Perspectives++ on+Large6Scale ...hyperion.usc.edu/UQ-SciML-2018/Mahoney.pdf · (Lin and Altman (2005) Am J Hum Genet) • The figure renders visual support

Interpreting the SVD - be very careful

Reification •  assigning a “physical reality” to large singular directions

•  invalid in general

Just because “If the data are ‘nice’ then SVD is appropriate” does NOT imply converse.

Mahoney and Drineas “CUR Matrix Decompositions for Improved Data Analysis” (PNAS, 2009)

Page 23: Algorithmic+and+Statistical+Perspectives++ on+Large6Scale ...hyperion.usc.edu/UQ-SciML-2018/Mahoney.pdf · (Lin and Altman (2005) Am J Hum Genet) • The figure renders visual support

Issues with eigen-analysis •  Computing large SVDs: computational time

•  In commodity hardware (e.g., a 4GB RAM, dual-core laptop), using MatLab 7.0 (R14), the computation of the SVD of the dense 2,240-by-447,143 matrix A takes about 20 minutes.

•  Computing this SVD is not a one-liner, since we can not load the whole matrix in RAM (runs out-of-memory in MatLab).

•  Instead, compute the SVD of AAT.

•  In a similar experiment, compute 1,200 SVDs on matrices of dimensions (approx.) 1,200-by-450,000 (roughly, a full leave-one-out cross-validation experiment). (e.g., Drineas, Lewis, & Paschou (2010) PLoS ONE)

•  Selecting actual columns that “capture the structure” of the top PCs •  Combinatorial optimization problem; hard even for small matrices.

•  Often called the Column Subset Selection Problem (CSSP).

•  Not clear that such “good” columns even exist.

•  Avoid “reification” problem of “interpreting” singular vectors!

Page 24: Algorithmic+and+Statistical+Perspectives++ on+Large6Scale ...hyperion.usc.edu/UQ-SciML-2018/Mahoney.pdf · (Lin and Altman (2005) Am J Hum Genet) • The figure renders visual support

Low-rank approximation algorithms

Many randomized algorithms for low-rank matrix approximation use extensions of these basic least-squares ideas: •  Random sampling CX/CUR algorithms (DMM07)

• Random projection algorithms (S08)

•  Column subset selection problem (BMD09)

•  Numerical implementations (LWMRT07,WLRT08,MRT11,RST09)

Page 25: Algorithmic+and+Statistical+Perspectives++ on+Large6Scale ...hyperion.usc.edu/UQ-SciML-2018/Mahoney.pdf · (Lin and Altman (2005) Am J Hum Genet) • The figure renders visual support

Recall, SVD decomposes a matrix as …

Top k left singular vectors

The SVD has very strong optimality properties., e.g. the matrix Uk is the “best” in many ways.

Ø  Note that, given Uk, the best X = UkTA = ΣVT.

Ø  SVD can be computed fairly quickly.

Ø  The columns of Uk are linear combinations of up to all columns of A.

Page 26: Algorithmic+and+Statistical+Perspectives++ on+Large6Scale ...hyperion.usc.edu/UQ-SciML-2018/Mahoney.pdf · (Lin and Altman (2005) Am J Hum Genet) • The figure renders visual support

CX (and CUR) matrix decompositions

c columns of A

Carefully chosen X

Goal: choose actual columns C to make (some norm) of A-CX small.

Why?

If A is a subject-SNP matrix, then selecting representative columns is equivalent to selecting representative SNPs to capture the same structure as the top eigenSNPs.

If A is a frequency-image astronomical matrix, then selecting representative columns is equivalent to selecting representative frequencies/wavelengths.

Note: To make C small, we want c as small as possible!

Mahoney and Drineas (2009, PNAS); Drineas, Mahoney, and Muthukrishnan (2008, SIMAX)

Page 27: Algorithmic+and+Statistical+Perspectives++ on+Large6Scale ...hyperion.usc.edu/UQ-SciML-2018/Mahoney.pdf · (Lin and Altman (2005) Am J Hum Genet) • The figure renders visual support

CX (and CUR) matrix decompositions

Easy to see optimal X = C+A.

Hard to find good columns (e.g., SNPs) of A to include in C.

This Column Subset Selection Problem (CSSP), heavily studied in N LA, is a hard combinatorial problem.

Mahoney and Drineas (2009, PNAS); Drineas, Mahoney, and Muthukrishnan (2008, SIMAX)

c columns of A

Two issues are connected

•  There exist (1+ε) “good” columns in any matrix that contain information about the top principal components: ||A-CX||F ≤ (1+ε) ||A-Ak||F

•  We can identify such columns via a simple statistic: the leverage scores. •  This does not immediately imply faster algorithms for the SVD, but, combined with random projections, it does!

•  Analysis (almost!) boils down to understanding least-squares approximation.

Page 28: Algorithmic+and+Statistical+Perspectives++ on+Large6Scale ...hyperion.usc.edu/UQ-SciML-2018/Mahoney.pdf · (Lin and Altman (2005) Am J Hum Genet) • The figure renders visual support

SNPs by chromosomal order

PCA

-sco

res

* top 30 PCA-correlated SNPs

Africa

Europe

Asia

America

Africa

Europe

Asia

America

Selecting PCA SNPs for individual assignment to four continents (Africa, Europe, Asia, America)

Paschou et al (2007; 2008) PLoS Genetics

Paschou et al (2010) J Med Genet

Drineas et al (2010) PLoS One

Javed et al (2011) Annals Hum Genet

Page 29: Algorithmic+and+Statistical+Perspectives++ on+Large6Scale ...hyperion.usc.edu/UQ-SciML-2018/Mahoney.pdf · (Lin and Altman (2005) Am J Hum Genet) • The figure renders visual support

Ranking Astronomical Line Indices

(Worthey et al. 94; Trager et al. 98)

Subspace Analysis of Spectra Cutouts: - Othogonality - Divergence - Commonality

(Yip et al. 2013 subm.)

Page 30: Algorithmic+and+Statistical+Perspectives++ on+Large6Scale ...hyperion.usc.edu/UQ-SciML-2018/Mahoney.pdf · (Lin and Altman (2005) Am J Hum Genet) • The figure renders visual support

Identifying new line indices objectively

(Yip et al. 2013 subm.)

Szalay (2012, MMDS); Yip et al (2013)

Page 31: Algorithmic+and+Statistical+Perspectives++ on+Large6Scale ...hyperion.usc.edu/UQ-SciML-2018/Mahoney.pdf · (Lin and Altman (2005) Am J Hum Genet) • The figure renders visual support

New Spectral Regions (M2;k=5; overselecting 10X; combine if <30A) Szalay (2012, MMDS); Yip et al (2013)

Old Lick indices are “ad hoc”

New indices are “objective”

•  Recover atomic lines

•  Recover molecular bands

•  Recover Lick indices

•  Informative regions are orthogonal to each other, in contrast to Lick regions

(Yip et al. 2013 subm.)

Page 32: Algorithmic+and+Statistical+Perspectives++ on+Large6Scale ...hyperion.usc.edu/UQ-SciML-2018/Mahoney.pdf · (Lin and Altman (2005) Am J Hum Genet) • The figure renders visual support

Outline  

•   “Algorithmic”  and  “statistical”  perspectives  on  data  problems  

•   Genetics  application      DNA  SNP  analysis  -­‐-­‐>  choose  columns  from  a  matrix  

 PMJKPGKD,  Genome  Research  ’07;  PZBCRMD,  PLOS  Genetics  ’07;  Mahoney  and  Drineas,  PNAS  ’09;  DMM,  SIMAX  ‘08;  BMD,  SODA  ‘09  

•   Internet  application    Community  finding  -­‐-­‐>  partitioning  a  graph  

 LLDM  (WWW  ‘08  &  TR  ‘08-­‐IM‘09  &  WWW  ‘10)    

Key  issue:  what  was  going  on  “under  the  hood”  in  these  two  applications  -­‐-­‐-­‐  use  statistical  properties  implicit  in  worst-­‐case  algorithms  to  make  domain-­‐specific  claims!  

Page 33: Algorithmic+and+Statistical+Perspectives++ on+Large6Scale ...hyperion.usc.edu/UQ-SciML-2018/Mahoney.pdf · (Lin and Altman (2005) Am J Hum Genet) • The figure renders visual support

Outline  

•   “Algorithmic”  and  “statistical”  perspectives  on  data  problems  

•   Genetics  (and  other  scientific)  application    

•   Internet  application  

 

Different  problems  

Common  approaches  

Different  perspectives  matter  

 

 

Going  forward:  opportunities  for  scientific  UQ/ML!