hdda-vi scientific program - fields.utoronto.ca filesession 8 chair: farouk nathoo 16:00 - 16:30...

22
HDDA-VI SCIENTIFIC PROGRAM

Upload: leque

Post on 16-Aug-2019

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: HDDA-VI SCIENTIFIC PROGRAM - fields.utoronto.ca fileSession 8 Chair: Farouk Nathoo 16:00 - 16:30 Anand Vidyashankar, George Mason University p.17 Title: Trade-o between E ciency and

HDDA-VISCIENTIFIC PROGRAM

Page 2: HDDA-VI SCIENTIFIC PROGRAM - fields.utoronto.ca fileSession 8 Chair: Farouk Nathoo 16:00 - 16:30 Anand Vidyashankar, George Mason University p.17 Title: Trade-o between E ciency and

HDDA-VI

Scientific Program

Wednesday May 25th

08:15 - 08:45 Registration and Coffee

08:45 - 09:00 Opening Remarks HDDA-VI

Session 1: Plenary TalkChair: Ejaz Ahmed

09:00 - 09:50 Kjell Doksum,University of Wisconsin, Madison p.9Title: Constructing Statistical Methods for High Dimensional Data

9:50 - 10:20 Coffee Break

Page 3: HDDA-VI SCIENTIFIC PROGRAM - fields.utoronto.ca fileSession 8 Chair: Farouk Nathoo 16:00 - 16:30 Anand Vidyashankar, George Mason University p.17 Title: Trade-o between E ciency and

Session 2Chair: Yulia Gel

10:20 - 10:50 Ekaterina Smirnova, University of Wyoming p.15Title: Network Connectivity Based Filtering of Microbiome Data

10:50 - 11:20 Ping Shou Zhong, Michigan State University p.18Title: Homogeneity test of covariance matrices and change-points identifi-cation with high dimensional longitudinal data

11:20 - 11:50 Jabed H. Tomal, University of Toronto p.16Title: Construction of Ensemble by Exploiting Richness of Feature Vari-ables in High-Dimensional Data with Applications in Protein Homology

11:50 - 12:20 Hossein Zareamoghaddam, Western University p.17Title: An Efficient Approach for Some Semi-Nonparametric Models Appli-cable to Mass-Spectrometry Data

12:20 - 13:30 Lunch Break

Session 3Chair: Yang Feng

13:30 - 14:00 Ivor Cribben, University of Alberta p.8Title: Graphical Models and Time-Varying Graphical Models for Estimat-ing Brain Networks

14:00 - 14:30 Guoqing Diao, George Mason University p.9Title: Covariate-Adjusted Semiparametric Transformation Graphical Mod-els with Applications to Time Series Imaging Data

14:30 - 15:00 George Michailidis, University of Michigan p.13Title: Joint Structural Estimation of Multiple Graphical Models

15:00 - 15:30 Bei Jiang, University of Alberta p.11Title: A New Graphical Approach for Quick and Exact Simulation of Cor-related Discrete Variables

2

Page 4: HDDA-VI SCIENTIFIC PROGRAM - fields.utoronto.ca fileSession 8 Chair: Farouk Nathoo 16:00 - 16:30 Anand Vidyashankar, George Mason University p.17 Title: Trade-o between E ciency and

15:30 - 16:00 Coffee Break

Session 4Chair: Farouk Nathoo

16:00 - 16:30 Rudy Beran, UC Davis p.7Title: Hypercube Fits to the Multivariate Linear Model

16:30 - 17:00 Rachel Levanger, Rutgers University p.12Title: Recent Developments in Topological Data Analysis

17:00 - 17:30 Abbas Khalili, McGill University p.11Title: New Estimation and Model Selection Methods in HeterogeneousTime Series Models

17:30 - 18:00 Linglong Kong, University of Alberta p.11Title: Quantile regression with varying coefficients for functional responses

Thursday May 26th

08:30 - 09:00 Coffee

Session 5: Plenary TalkChair: Shakhawat Hossain

09:00 - 09:50 Jianqing Fan,Princeton University p.10Title: Robust Low-Rank Matrix Recovery

9:50 - 10:20 Coffee Break

3

Page 5: HDDA-VI SCIENTIFIC PROGRAM - fields.utoronto.ca fileSession 8 Chair: Farouk Nathoo 16:00 - 16:30 Anand Vidyashankar, George Mason University p.17 Title: Trade-o between E ciency and

Session 6Chair: Bahadir Yuzbasi

10:20 - 10:50 Hongyuan Cao, University of Missouri p.8Title: Large scale multiple testing for clustered signals

10:50 - 11:20 Yuehua Cui, Michigan State University p.8Title: Integrative genetical genomics analysis incorporating network struc-tures

11:20 - 11:50 Yang Feng, Columbia University p.10Title: Neyman-Pearson Classification under High-Dimensional Settings

11:50 - 12:20 Shuangge Ma, Yale University p.14Title: Promote similarity in integrative analysis

12:20 - 13:30 Lunch Break

Session 7Chair: Yulia Gel

13:30 - 14:00 Gokhan Yildirim, York University p.17Title: On Lasso trade-off diagram under correlated random design matrix

14:00 - 14:30 Xuewen Lu, University of Calgary p.13Title: Group Selection in A Semiparametric Accelerated Failure TimeModel

14:30 - 15:00 Jiwei Zhao, SUNY-Buffalo p.18Title: Variable selection in the presence of nonignorable missing data

15:00 - 15:30 Yi Li, University of Michigan p.12Title: Classification with Ultrahigh-Dimensional Features

15:30 - 16:00 Coffee Break and Poster Session

4

Page 6: HDDA-VI SCIENTIFIC PROGRAM - fields.utoronto.ca fileSession 8 Chair: Farouk Nathoo 16:00 - 16:30 Anand Vidyashankar, George Mason University p.17 Title: Trade-o between E ciency and

Session 8Chair: Farouk Nathoo

16:00 - 16:30 Anand Vidyashankar, George Mason University p.17Title: Trade-off between Efficiency and Robustness in Post-Model Selec-tion Inference

16:30 - 17:00 Bahadir Yuzbasi, Inonu University p.17Title: Stein-Type Generalized Ridge Regression in High-Dimensional SparseModels

17:00 - 17:30 Shakhawat Hossain, University of Winnipeg p.10Title: Shrinkage and pretest estimators for longitudinal data analysis un-der partially linear models

17:30 - 18:00 Kun Liang, University of Waterloo p.12Title: False discovery rate estimation with covariates

Friday May 27th

08:30 - 09:00 Coffee

Session 9Chair: Shakhawat Hossain

9:00 - 9:30 Folefac D. Atem, University of Texas Health Science Center-Houston p.7Title: Imputation Methods and Survival Regression

9:30 - 10:00 Ali Shojaie, University of Washington p.15Title: Network Reconstruction From High Dimensional Ordinary Differen-tial Equations

10:00 - 10:30 L. Leticia Ramirez-Ramirez, ITAM and UTD p.15Title: Semiparametric Trend Estimation of Multivariate Time Series withControlled Smoothness

5

Page 7: HDDA-VI SCIENTIFIC PROGRAM - fields.utoronto.ca fileSession 8 Chair: Farouk Nathoo 16:00 - 16:30 Anand Vidyashankar, George Mason University p.17 Title: Trade-o between E ciency and

10:30 - 11:00 Coffee Break

11:00 - 11:30 Matt Taddy, University of Chicago p.16Title: Machine Learning in the Presence of Instrumental Variables

11:30 - 12:00 Luke Bornn, Simon Fraser University p.7Title: From Pixels to Points: Using Tracking Data to Measure Performancein Professional Sports

12:00 - 13:00 Lunch Break

Session 10Chair: Ejaz Ahmed

13:00 - 13:30 Slava Lyubchich, UMCES p.13Title: Fast Nonparametric Bootstrap for the Inference on Degree Distribu-tion in Random Networks

13:30 - 14:00 Vahid Partovi Nia, Ecole Polytechnique de Montreal p.14Title: Bayesian Clustering of Data and Dimensions

14:00 - 14:30 Mohamed Amezziane, Central Michigan University p.7Title: Penalty-Free High-Dimensional Regression Model and Variable Se-lection

14:30 Panel discussions – future directions

6

Page 8: HDDA-VI SCIENTIFIC PROGRAM - fields.utoronto.ca fileSession 8 Chair: Farouk Nathoo 16:00 - 16:30 Anand Vidyashankar, George Mason University p.17 Title: Trade-o between E ciency and

Abstracts: Invited Talks

• Speaker: Mohamed Amezziane, Central Michigan University

Title: Penalty-Free High-Dimensional Regression Model and Variable Selection

Coauthor(s): S. Ejaz Ahmed

Abstract: A high-dimensional regression analysis is conducted through simple statisti-cal manipulation of observation-wise estimators of linear regression coefficients, leadingto new coefficients’ estimators along with their standard errors. These estimators arethen used to devise simple variable selection techniques. The estimators’ performanceare assessed both analytically and through simulation.

• Speaker: Folefac D. Atem, University of Texas Health Science Center-Houston

Title: Imputation Methods and Survival Regression

Coauthor(s): Jing Qian, Rebecca Betensky

Abstract: The association between maternal age of onset of dementia and beta-amyloiddeposition (measured by in vivo PET imaging) in cognitively normal older offspring isof interest. In a regression model for beta-amyloid, special methods are required due tothe random right censoring of the covariate of maternal age of onset of dementia. Priorliterature has proposed methods to address the problem of censoring due to assay limitof detection, but not random censoring. We propose imputation methods and a survivalregression method that do not require parametric assumptions about the distributionof the censored covariate. Existing imputation methods address missing covariates, butnot right censored covariates. In simulation studies, we compare these methods to thesimple, but inefficient complete case analysis, and to thresholding approaches. We applythe methods to the Alzheimers study.

• Speaker: Rudolf Beran, UC Davis

Title: Hypercube Fits to the Multivariate Linear Model

Abstract: Hypercube fits to the multivariate linear model complete the class of Penal-ized Least Squares (PLS) fits with quadratic penalties. They include submodel LeastSquares fits that are limits of PLS fits as penalty weights tend to infinity. Throughcontrol of condition number, they improve the numerical stability of PLS fits. Adaptivehypercube fits that minimize estimated risk can behave asymptotically, as the number ofregressors increases, like their oracle counterparts. In particular, suitable adaptive hy-percube fits extend to general regression designs the asymptotic risk reduction achievedby multiple Efron-Morris affine shrinkage in balanced orthogonal designs. Reduced riskfits to unbalanced MANOVA designs illustrate.

• Speaker: Luke Bornn, Simon Fraser University University

Title: From Pixels to Points: Using Tracking Data to Measure Performance in Profes-sional Sports

7

Page 9: HDDA-VI SCIENTIFIC PROGRAM - fields.utoronto.ca fileSession 8 Chair: Farouk Nathoo 16:00 - 16:30 Anand Vidyashankar, George Mason University p.17 Title: Trade-o between E ciency and

Abstract: In this talk I will explore how players perform, both individually and asa team, on a basketball court. By blending advanced spatio-temporal models withgeography-inspired mapping tools, we are able to understand player skill far betterthan either individual tool allows. Using optical tracking data consisting of hundreds ofmillions of observations, I will demonstrate these ideas by characterizing defensive skilland decision making in NBA players.

• Speaker: Hongyuan Cao, Department of Statistics, University of Missouri-Columbia

Title: Large scale multiple testing for clustered signals

Coauthor(s): Wei Biao Wu

Abstract: We propose a change point detection method for large scale multiple testingproblems with clustered signals. Unlike the classic change point detection setup, thesignals can vary in size and distribution within a cluster. The spatial structure on thesignals enables us to accurately delineate the boundaries between null and alternativehypotheses. New test statistics are proposed for observations from one sequence andmultiple sequences. Their asymptotic distributions are established with consistent es-timators for unknown parameters. We allow the variances to be heteroscedastic in themultiple sequence case, which greatly expands the applicability of the proposed method.Simulation studies demonstrate that the large sample approximations are adequate forpractical use and may yield favorable performance. Dataset from array CGH and DNAmethylation are used to demonstrate the utility of the proposed methods.

• Speaker: Ivor Cribben, Alberta School of Business

Title: Graphical models and time-varying graphical models for estimating brain net-works

Coauthor(s): Yi Yu and Yunan Zhu

Abstract: Graphical models are frequently used to explore networks among a set ofvariables. In the first part of this talk, we explore the practical performance of severalsparse graphical methods and several selection criteria for estimating brain networksusing both simulated multivariate normal data and autocorrelated data. We use eval-uation criteria to compare the methods and thoroughly discuss the superiority anddeficiency of each of them. We also apply the methods to a resting state functionalmagnetic resonance imaging (fMRI) experiment and to a language processing exper-iment. In the second part of the talk, we consider data-driven methods that detectchange points in the network structure of a multivariate time series taken from brainimaging experiments. The methods allow for estimation of both the time of change inthe network structure and the graph between each pair of change points, without priorknowledge of the number or location of the change points. The methods are applied tovarious simulated high dimensional data sets as well as to fMRI data sets. The resultsillustrate the methods ability to observe how the network structure between differentbrain regions changes over the experimental time course.

• Speaker: Yuehua Cui, Department of Statistics and Probability, Michigan State Uni-versity

8

Page 10: HDDA-VI SCIENTIFIC PROGRAM - fields.utoronto.ca fileSession 8 Chair: Farouk Nathoo 16:00 - 16:30 Anand Vidyashankar, George Mason University p.17 Title: Trade-o between E ciency and

Title: Integrative genetical genomics analysis incorporating network structures

Coauthor(s): Bin Gao and Xu Liu

Abstract: Genetical genomics data provide promising opportunities for integrativeanalysis of gene expression and genotype data. Lin et al. (2015) recently proposed aninstrumental variables (IV) regression framework to select important genes with highdimensional genetical genomics data. The IV regression solves the endogeneity problemdue to correlation between gene expressions and the error term, hence improves theperformance of gene selection. As genes function in networks to fulfill their joint task,incorporating network or graph structures in a regression model can further improvegene selection performance. In this work, we propose a graph constrained penalizedIV regression framework to solve the endogeneity issue and to improve the selectionperformance via incorporating a gene network structure. We propose a two-step esti-mation procedure by adopting a network constrained regularization method to obtainbetter variable selection and estimation, and further establish the selection consistency.Simulation and real data analysis are conduced to show the utility of the method.

• Speaker: Guoqing Diao, George Mason University

Title: Covariate-Adjusted Semiparametric Transformation Graphical Models with Ap-plications to Time Series Imaging Data

Coauthor(s): Anand Vidyashankar and Ivor Cribben

Abstract: High-dimensional time series data are frequently encountered in imagingstudies and graphical models have been used to assess brain connectivity. In thesemodels, it is common to assume that the variables are multivariate normal yielding theclassical Gaussian graphical model. However, violation of the distributional assumptioncan lead to biased statistical estimates and interpretations. To address this issue, wepropose a new semiparametric transformation Gaussian graphical model, in which thetime series of each variable of interest are multivariate normal conditional on the co-variates after an unspecified transformation. The proposed model also accounts for thecorrelations amongst time series data through the use of random effects. We propose athree-step procedure to construct the graph. Extensive simulation studies demonstratethat the proposed model outperforms the existing methods when the model assump-tions are violated and is comparable to the existing methods under the true modelspecification. An application to a real data set is also provided.

• Speaker: Kjell Doksum, Department of Statistics, University of Wisconsin, Madison

Title: Constructing Statistical Methods for High Dimensional Data

Abstract: Recent work in regression analysis have embrased an approach to highdimensional data analysis that consists of selecting at random subsets with a relativelysmall number of predictors, doing variable selection and/or statistical inference on eachsubset, and then merging the results from the subsets. The merging may involve furthervariable selection and/or statistical inference on the the merged subsets. This approachmakes it possible to construct methods for high dimensional data analysis using methodsthat were designed for small dimensional data. This talk will present such constructionsand examine their properties.

9

Page 11: HDDA-VI SCIENTIFIC PROGRAM - fields.utoronto.ca fileSession 8 Chair: Farouk Nathoo 16:00 - 16:30 Anand Vidyashankar, George Mason University p.17 Title: Trade-o between E ciency and

• Speaker: Jianqing Fan, Princeton University

Title: Robust Low-Rank Matrix Recovery

Coauthor(s): Weichen Wang and Ziwei Wang

Abstract: This paper focuses on robust estimation of the low-rank matrix from thetrace regression model. It encompasses four popularly problems: Sparse linear models,compressive sensing, matrix completion and multi-task regression as its specific exam-ples. Instead of optimizing nuclear-norm penalized least-squares, our robust penalizedleast-squares approach is to replace the quadratic loss by its robust version. The ro-bust version is obtained by the appropriate truncations or shrinkage of the data andhence is very easy to implement. Under only bounded 2 + δ moment condition on thenoise, we show that the proposed robust penalized trace regression yields an estimatorthat processes the same rates as those presented in Negahban and Wainwright’s workunder sub-Gaussian error assumption. The rates of convergence are explicitly derived.As a byproduct, we also give a robust covariance matrix estimation and establish itsconcentration inequality.

• Speaker: Yang Feng, Columbia University

Title: Neyman-Pearson Classification under High-Dimensional Settings

Abstract: Most existing binary classification methods target on the optimization ofthe overall classification risk and may fail to serve some real-world applications suchas cancer diagnosis, where users are more concerned with the risk of misclassifying onespecific class than the other. Neyman-Pearson (NP) paradigm was introduced in thiscontext as a novel statistical framework for handling asymmetric type I/II error prior-ities. It seeks classifiers with a minimal type II error and a constrained type I errorunder a user specified level. We construct classifiers with guaranteed theoretical perfor-mance under the NP paradigm in high-dimensional settings. Based on the fundamentalNeyman-Pearson Lemma, we used a plug-in approach to construct NP-type classifiersfor Naive Bayes models. The proposed classifiers satisfy the NP oracle inequalities,which are natural NP paradigm counterparts of the oracle inequalities in classical bi-nary classification. Besides their desirable theoretical properties, we also demonstratedtheir numerical advantages in prioritized error control via both simulation and real datastudies.

• Speaker: Shakhawat Hossain, University of Winnipeg

Title: Shrinkage and pretest estimators for longitudinal data analysis under partiallylinear models

Coauthor(s): S. Ejaz Ahmed, Grace Y. Yi, and B. Chen

Abstract: In this talk, we develop marginal analysis methods for longitudinal dataunder partially linear models. We employ the pretest and shrinkage estimation proce-dures to estimate the mean response parameters as well as the association parameters,which may be subject to certain restrictions. We provide the analytic expressions forthe asymptotic biases and risks of the proposed estimators, and investigate their rela-tive performance to the unrestricted semiparametric least squares estimator (USLSE).

10

Page 12: HDDA-VI SCIENTIFIC PROGRAM - fields.utoronto.ca fileSession 8 Chair: Farouk Nathoo 16:00 - 16:30 Anand Vidyashankar, George Mason University p.17 Title: Trade-o between E ciency and

We show that if the dimension of association parameters exceeds two, the risk of theshrinkage estimators are strictly less than that of the USLSE in most of the parameterspace. On the other hand, the risk of the pretest estimator depends on the validity ofthe restrictions of association parameters. A simulation study is conducted to evalu-ate the performance of the proposed estimators relative to that of the USLSE. A realdata example is applied to illustrate the practical usefulness of the proposed estimationprocedures.

• Speaker: Bei Jiang, University of Alberta

Title: A new graphical approach for quick and exact simulation of correlated discretevariables.

Coauthor(s): Mike Kouritzin

Abstract: Simulation of correlated discrete variables with specified marginals andcovariances have important applications. For example, there is a need in neuroscienceresearch to simulate discrete neural spike train data across brain regions. In this talk,we present a new graphical approach for simulating such random variables using anefficient one-pass algorithm, where the random sample is drawn for each variable in oneiteration. We also give the conditions for compatibility of the marginal probabilitiesand covariances. This one-pass algorithm also leads to the construction of a familyof Markov random fields on a directed acyclic graph with conditional and joint fielddistributions. A necessary and sufficient condition that guarantees the permutationproperty of the derived random field is studied.

• Speaker: Abbas Khalili, Dept. of Mathematics and Statistics, McGill University

Title: New Estimation and Model Selection Methods in Heterogeneous Time SeriesModels

Abstract: In this talk we discuss an extension of standard auto-regressive modelsto capture heterogeneous behavior in mean level, volatility and multi-modality of theconditional or marginal distributions of time series that are observed in practice inmany financial and econometric examples. We will develop new estimation and modelselection techniques in these models. The new methods are assessed theoretically, viasimulations and a real data analysis.

• Speaker: Linglong Kong, University of Alberta

Title: Quantile regression with varying coefficients for functional responses

Coauthor(s): Xingcai Zhou, Rohana Karunamuni, and Hongtu Zhu

Abstract: With modern technology development, functional data are often observedin various scientific fields. Quantile regression has become an important statisticalmethodology. In this paper, we consider the estimation and inference about varyingcoefficients models for functional responses on quantile regression processes. We firstpropose to estimate the quantile smooth coefficient functions using local linear approxi-mations, obtain the global uniform Bahadur representation of the estimator with respectto the time or the location and the quantile level, and show that the estimator converges

11

Page 13: HDDA-VI SCIENTIFIC PROGRAM - fields.utoronto.ca fileSession 8 Chair: Farouk Nathoo 16:00 - 16:30 Anand Vidyashankar, George Mason University p.17 Title: Trade-o between E ciency and

weakly to a two-parameter continuous Gaussian process, and then we obtain asymp-totic bias and mean integrated square error of smoothed individual functions and theiruniform convergence rate under the given quantile level. We propose a global test forlinear hypotheses of varying coefficient functions under quantile processes, and deriveits asymptotic distribution under the null hypothesis; and also give their simultaneousconfidence bands. For develop these inferences, some unknown error densities are es-timated by the “residual-based” empirical distributions. A Monte Carlo simulation isconducted to examine the finite-sample performance of the proposed procedures. Fi-nally, we illustrate the estimation and inference procedures of QRVC to diffusion tensorimaging data and ADHD-200 fMRI data.

• Speaker: Rachel Lavenger, Rutgers University

Title: Recent Developments in Topological Data Analysis

Abstract: Topological data analysis (TDA) is a growing branch of mathematics thatconcerns the study of the shape of inherently high-dimensional data. One particularlyuseful tool is that of persistent homology, where n-dimensional topological features ofthe data are encoded into a collection of coordinates in the plane, called a persistencediagram. Recently, persistence diagrams have demonstrated their usefulness in gener-ating summary statistics for machine learning algorithms. However, for particularlylarge datasets or noisy images, approximations or smoothing must first be applied toeither make computations possible or to clean the signature of the features. In this talk,we introduce persistent homology and show a recent result that makes it possible tocompute rigorous bounds on the amount of error these approximations introduce intothe system being studied. We will focus on two examples: estimating the shape of alarge point cloud and studying features in noisy images.

• Speaker:Yi Li, University of Michigan

Title: Classification with Ultrahigh-Dimensional Features

Abstract: Although much progress has been made in classification with high-dimensionalfeatures, classification with ultrahigh-dimensional features, wherein the features muchoutnumber the sample size, defies most existing work. This paper introduces a novel andcomputationally feasible multivariate screening and classification method for ultrahigh-dimensional data. Leveraging inter-feature correlations, the proposed method enablesdetection of marginally weak and sparse signals and recovery of the true informative fea-ture set, and achieves asymptotic optimal misclassification rates. We also show that theproposed procedure provides more powerful discovery boundaries compared to those inCai and Sun (2014) and Jin et al. (2009). The performance of the proposed procedure isevaluated using simulation studies and demonstrated via classification of patients withdifferent post-transplantation renal functional types.

• Speaker: Kun Liang, University of Waterloo

Title: False discovery rate estimation with covariates

Abstract: Multiple testing becomes an increasingly important topic in high-dimensionalstatistical analysis. However, most commonly used false discovery rate estimation and

12

Page 14: HDDA-VI SCIENTIFIC PROGRAM - fields.utoronto.ca fileSession 8 Chair: Farouk Nathoo 16:00 - 16:30 Anand Vidyashankar, George Mason University p.17 Title: Trade-o between E ciency and

control methods do not take covariates into consideration. To better estimate falsediscovery rate, we propose a novel nonparametric method which efficiently utilizes thecovariate information. Our proposed method enjoys some desirable theoretical proper-ties. In addition, we evaluate the performance of our proposed method over existingmethods using simulation studies.

• Speaker: Xuewen Lu, University of Calgary

Title: Group Selection in A Semiparametric Accelerated Failure Time Model

Coauthor(s): Longlong Huang and Karen Kopciuk

Abstract: In survival analysis, a number of regression models can be used to estimatethe effects of covariates on the censored survival outcome. When covariates can benaturally grouped, group selection is important in these models. Motivated by thegroup bridge approach for variable selection in a multiple linear regression model, weconsider group selection in a semiparametric accelerated failure time (AFT) modelusing Stute’s weighted least squares and a group bridge penalty. This method is able tosimultaneously carry out feature selection at both the group and within-group individualvariable levels, and enjoys the powerful oracle group selection property. Simulationstudies indicate that the group bridge approach for the AFT model can correctly identifyimportant groups and variables even with high censoring rate. A real data analysis isprovided to illustrate the application of the proposed method.

• Speaker: Vyacheslav Lyubchich, University of Maryland Center for EnvironmentalScience

Title: Fast nonparametric bootstrap for the inference on degree distribution in randomnetworks

Coauthor(s): Yulia R. Gel, L. Leticia Ramirez Ramirez

Abstract: Challenges in the inference on random networks relate to the restrictive andhard to validate parametric assumptions, as well as data volume and velocity, whichdeprive one of the ability to obtain population parameters directly. Sampling proce-dures, coupled with nonparametric bootstrap, circumvent the problems of parametricmodel specification and incomplete information about the network. The proposed non-parametric patchwork resampling adapts the ”blocking” argument, developed for timeseries bootstrap and spatial data re-tiling, to random networks. In contrast to blockbootstrap in time series, its primary focus is on mirroring the asymptotic distributionof certain statistics of interest rather than on recreating the data generating process.In this presentation, we focus on how the new bootstrap procedure can be used toquantify estimation uncertainty for network statistics that are functions of degree dis-tribution. We present a new computationally efficient and data-driven cross-validationalgorithm for selecting an optimal patch size. The suggested procedures are illustratedusing simulated and observed social networks.

• Speaker: George Michailidis, University of Florida

Title: Joint Structural Estimation of Multiple Graphical Models

Coauthor(s): Jing Ma

13

Page 15: HDDA-VI SCIENTIFIC PROGRAM - fields.utoronto.ca fileSession 8 Chair: Farouk Nathoo 16:00 - 16:30 Anand Vidyashankar, George Mason University p.17 Title: Trade-o between E ciency and

Abstract: Gaussian graphical models capture dependence relationships between ran-dom variables through the pattern of nonzero elements in the corresponding inversecovariance matrices. To date, there has been a large body of literature on both compu-tational methods and analytical results on the estimation of a single graphical model.However, in many application domains, one has to estimate several related graphicalmodels, a problem that has also received attention in the literature. The available ap-proaches usually assume that all graphical models are globally related. On the otherhand, in many settings different relationships between subsets of the node sets existbetween different graphical models. We develop methodology that jointly estimatesmultiple Gaussian graphical models, assuming that there exists prior information onhow they are structurally related. For many applications, such information is availablefrom external data sources. The proposed method consists of first applying neighbor-hood selection with a group lasso penalty to obtain edge sets of the graphs, and amaximum likelihood re t for estimating the nonzero entries in the inverse covariancematrices. We establish consistency of the proposed method for sparse high-dimensionalGaussian graphical models and examine its performance using simulation experiments.An application to a climate data set is also discussed.

• Speaker: Shuangge Ma, Yale University

Title: Promote similarity in integrative analysis

Abstract: For multiple high-dimensional problems, it is desired to conduct the inte-grative analysis of multiple independent datasets. Under a few important scenarios, itcan be expected that the estimates of multiple datasets are similar in certain aspects,which may include magnitude, sparsity structure, sign, and others. The existing ap-proaches do not have a mechanism promoting such similarity. In our study, we conductthe integrative analysis of multiple independent datasets. Penalization techniques aredeveloped to explicitly promote similarity. The consistency properties are rigorouslyestablished. Numerical studies, including simulation and data analysis, show that theproposed approach has significant advantages over the existing benchmark.

• Speaker: Vahid Partovi Nia, Ecole Polytechnique de Montreal

Title: Bayesian Clustering of Data and Dimensions

Abstract: Clustering or unsupervised learning is one of the frequently-used exploratorytechniques to uncover data pattern. Including large number of unnecessary dimensions(or attributes) affects most of the pattern recognition algorithms negatively. As aremedy, often informative dimensions are selected or data are projected into smallernumber of dimensions. Our approach is a bit different: we suggest grouping subjectsand dimensions at the same time, called bi-clustering. Hierarchical clustering is one ofthe most popular pattern recognition techniques, because it produces dendrogram –avisual guide to data clusters with different number of groups. We generalize the commonhierarchical clustering algorithms to hierarchical biclustering, to increase the precisionof the estimated groupings in the presence of correlated or noise dimensions. A model-based bi-clustering gives a better understanding of biclusters statistically, therefore amodel-based discrepancy measure such as the ward linkage looks more appropriate. We

14

Page 16: HDDA-VI SCIENTIFIC PROGRAM - fields.utoronto.ca fileSession 8 Chair: Farouk Nathoo 16:00 - 16:30 Anand Vidyashankar, George Mason University p.17 Title: Trade-o between E ciency and

make a bridge between the ward linkage and a Bayesian model to produce scalablehierarchical bi-clustering algorithms and treat large data.

• Speaker: L. Leticia Ramirez-Ramirez, Centro de Investigacion en Matematicas (CIMAT)

Title: Semiparametric Trend estimation of multivariate time series with controlledsmoothness

Coauthor(s): Victor Guerrero, Alejandro Islas-Camargo

Abstract: We present a filtering technique to estimate trends of multivariate timeseries. This methos is based on a vector signal-plus-noise representation of PenalizedLeast Squares that requires only the first two sample moments, and introduces an indexof smoothness. This index allows setting in advance a desired amount of smoothnessto achieve. It is also a function of the correlation between the noises of the seriesand the sample size. Our proposal arises from a statistical solution to a multivariateGLS problem. Such a solution leads to an index of smoothness that is applicable inthe general multivariate case, but we pay special attention to the bivariate situation.Here we show the closed-form expressions for calculating trend estimates with theircorresponding variance-covariance matrices, and present the proposed algorithm forsmoothing bivariate time series. We discuss the results on simulated data and a realapplication.

• Speaker: Ali Shojaie, University of Washington

Title: Network Reconstruction From High Dimensional Ordinary Differential Equa-tions

Abstract: We consider the task of learning a dynamical system from high-dimensionaltime-course data. For instance, we might wish to estimate a gene regulatory networkfrom gene expression data measured at discrete time points. We model the dynamicalsystem non-parametrically as a system of additive ordinary differential equations. Mostexisting methods for parameter estimation in ordinary differential equations estimatethe derivatives from noisy observations. This has been shown to be challenging andinefficient. We propose a novel approach that does not involve derivative estimation.We show that the proposed method can consistently recover the true network structureeven in high dimensions, and we demonstrate empirical improvement over competingapproaches.?

• Speaker: Ekaterina Smirnova, University of Wyoming

Title: Network Connectivity Based Filtering of Microbiome Data

Coauthor(s): Farhad Jafari, Snehalata Huzurbazar

Abstract: Human Microbiome Project (HMP) is a large scale nationwide study thatutilizes next generation sequencing technology (NGS) to investigate the relationshipsbetween the human microbiota composition, diet and health status. Fragments of DNAsequences obtained in these experiments are classified at a species level, and typicallyreferred to as species or taxa. One particular characteristic of these studies is that the

15

Page 17: HDDA-VI SCIENTIFIC PROGRAM - fields.utoronto.ca fileSession 8 Chair: Farouk Nathoo 16:00 - 16:30 Anand Vidyashankar, George Mason University p.17 Title: Trade-o between E ciency and

data are often quite sparse but collected on a large number of variables, many of whichare possible contaminants. To remove possible contaminants, a data normalization step,known in microbiome literature as filtering is applied prior to analysis. Currently thereis neither any consensus on filtering criteria used, nor is there an evaluation of loss dueto filtering done. We propose a taxa co-presence network based data normalizationmethod that removes extremely rare taxa and evaluates loss due to filtering.

• Speaker: Matt Taddy, Microsoft Research and Chicago Booth

Title: Machine learning in the presence of instrumental variables

Abstract: Machine learning is very good at solving reduced-form prediction problems:forecasting future response values when model inputs come from some stationary dis-tribution. In this work, we consider directing ML tools instead towards causal inferenceproblems. In particular, we describe how the presence of instrumental variables allowsus to break the causal inference task into a series of nonparametric prediction prob-lems. Examples include analysis of public-works contracts and of on-line marketingtransactions.

• Speaker: Jabed H. Tomal, University of Toronto

Title: Construction of Ensemble by Exploiting Richness of Feature Variables in High-Dimensional Data with Applications in Protein Homology

Coauthor(s): William J. Welch and Ruben H. Zamar, The University of BritishColumbia

Abstract: High-dimensional data often contain a large number of observations andfeature variables. In this work, we have developed a model which uses the richness ofinformation presents in the large number of feature variables in high-dimensional datato predict a response. The proposed model - which is an aggregated collection of lo-gistic regression models (LRM) - is called an ensemble, where each constituent LRM isfitted to a subset of feature variables. An algorithm is developed to cluster the featurevariables into subsets in a way that the variables in a subset appear to be good to puttogether in an LRM, and the variables in different subsets appear to be good in sepa-rate LRMs. The strength of the ensemble depends on the algorithm’s ability to identifystrong and diverse subsets of useful feature variables present in high-dimensional data.We named each subset of variables a phalanx, and the resulting ensemble an ensem-ble of phalanxes. Homologous proteins are considered to have a common evolutionaryorigin. To produce an evolutionary sequence of proteins, a scientist needs to predicttheir biological homogeneity. The proposed model has been applied to predict biologi-cal homogeneity of proteins using feature variables obtained from the similarity searchbetween a candidate protein and a target protein. The underlying assumption is thatthe structural similarity of proteins relates to their biological homogeneity. Consideringscarcity of homologous proteins, the prediction performances of a model are evaluatedby checking its ability to rank/sequence rare homologous proteins ahead of the non-homologous proteins. The protein homology data are obtained from the 2004 KDD cupwebsite. While the prediction performance of an ensemble of phalanxes is competitiveto contemporary state-of-the-art ensembles and the winning procedures of the 2004

16

Page 18: HDDA-VI SCIENTIFIC PROGRAM - fields.utoronto.ca fileSession 8 Chair: Farouk Nathoo 16:00 - 16:30 Anand Vidyashankar, George Mason University p.17 Title: Trade-o between E ciency and

KDD cup competition, a further improvement in prediction performances is achievedby aggregating two diverse ensembles of phalanxes obtained from optimizing two com-plementary evaluation metrics. Through parallel computation, the proposed ensembleis shown computationally efficient as well.

• Speaker: Anand N. Vidyashankar, George Mason University

Title: Trade-off between Efficiency and Robustness in Post-Model Selection Inference

Abstract: It is common practice in high-dimensional data analysis that a model se-lection is first performed and then inference is carried out using the selected modelpresuming that the chosen model is the true model; that is, without accounting formodel selection uncertainty. Recently, methods such as clean and screen are beingused to account for model selection uncertainty. However, the robustness and the ef-ficiency properties of the resulting statistical procedures are largely unknown. In thispresentation, we provide a systematic account of efficiency and robustness propertiesof post-selection estimators. In the process we address some foundational questionsconcerning the role of moderate deviation theory in the study of statistical efficiencyand robustness and their trade-offs.

• Speaker: Gokhan Yildirim, York University

Title: On Lasso trade-off diagram under correlated random design matrix

Coauthor(s): E. Ahmed, B. Yuzbasi and H. Kim.

Abstract: It is usually assumed by data scientists who very often use Lasso that whensignals are sufficiently strong compared to the noise level, a properly tuned Lasso selectsall true variables before any null variable is selected. We consider linear regression modelwith correlated random design matrix and under varies sparsity settings, and presentsome simulation results on the trade-off between false positive and true positive ratesalong the Lasso path.

• Speaker: Bahadir Yuzbasi, Inonu University

Title: Stein-Type Generalized Ridge Regression in High-Dimensional Sparse Models

Coauthor(s): S. Ejaz Ahmed

Abstract: In this study, we suggest shrinkage estimation strategies based on gener-alized ridge regression in High-Dimensional sparse models when the design matrix isill-conditioned. The performance of the proposed estimators is compared with the usualridge regression and some penalty estimators through Monte Carlo simulation. Finally,a real data example is given to the usefulness of the suggested estimators.

• Speaker: Hossein Zareamoghaddam, Western University

Title: An Efficient Approach for Some Semi-Nonparametric Models Applicable toMass-Spectrometry Data

Coauthor(s): 1. Ejaz Ahmed - Brock University, 2. Serge Provost - Western University

Abstract: Modeling the mass-spectrometry data is important to identify and charac-terize hundreds of thousands of proteins or molecules per experiment. The large volume

17

Page 19: HDDA-VI SCIENTIFIC PROGRAM - fields.utoronto.ca fileSession 8 Chair: Farouk Nathoo 16:00 - 16:30 Anand Vidyashankar, George Mason University p.17 Title: Trade-o between E ciency and

of such data from a typical mass-spectrometry experiment needs heavy computationsand storage. In this work, a semi-nonparametric regression model, which consists of alinear parametric components for individual location and scale as well as a nonpara-metric regression function for the common shape, is considered. Our approach givesaccurate estimations for both the parametric and nonparametric components. Then,some shrinkage and pre-test techniques are applied to improve the parametric compo-nents. To demonstrate the effectiveness of this approach, it is applied to a SELDI-TOFmass-spectrometry data collected from a study on liver cancer patients.

• Speaker: Jiwei Zhao, State University of New York at Buffalo

Title: Variable selection in the presence of nonignorable missing data

Coauthor(s): Yang Yang, State University of New York at Buffalo; Yang Ning, Prince-ton University

Abstract: Variable selection methods are well developed for a completely observeddata set in the past two decades. In the presence of missing values, those methodsneed to be tailored to different missing data mechanisms. In this paper, we focuson a flexible and generally applicable missing data mechanism, which contains bothignorable and nonignorable missing data mechanism assumptions. We show how theregularization approach for variable selection can be adapted to the situation under thismissing data mechanism. The computational and theoretical properties for variableselection consistency are established. The proposed method is further illustrated bycomprehensive simulation studies, for both low and high dimensional settings.

• Speaker: Ping-Shou Zhong, Michigan State University

Title: Homogeneity test of covariance matrices and change-points identification withhigh dimensional longitudinal data

Coauthor(s): Runze Li, Penn State University

Abstract: High-dimensional longitudinal data such as time-course microarray dataare now widely available. One important feature of such data is that, for each indi-vidual, high-dimensional measurements are repeatedly collected over time. Moreover,these measurements are spatially and temporally dependent which, respectively, refersto dependence within each particular time point and among different time points. Thispaper focuses on testing the homogeneity of covariance matrices of high-dimensionalmeasurements over time against the change-point type alternatives. We allow the di-mension of measurements (p) to be much larger than the number of individuals (n).Specifically, a test statistic for the equivalence of covariance matrices is proposed andthe asymptotic normality is established. In addition to testing, an estimator for thelocation of the change point is given whose rate of convergence is established and shownto depend on p, n and the signal-to-noise ratio. The proposed method is extended tolocate multiple change points by applying a binary segmentation approach, which isshown to be consistent under some mild conditions. The proposed testing procedureand change-point identification methods are able to accommodate both spatial and tem-poral dependences. Simulation studies and an application to a time-course microarraydata set are presented to demonstrate the performance of the proposed method.

18

Page 20: HDDA-VI SCIENTIFIC PROGRAM - fields.utoronto.ca fileSession 8 Chair: Farouk Nathoo 16:00 - 16:30 Anand Vidyashankar, George Mason University p.17 Title: Trade-o between E ciency and

Abstracts: Posters

• Presenter: Elnaz Bigdeli, Global Technology Services, Deloitte

Title: Adaptive Density-Based Spatial Clustering of High-Dimensional Data

Coauthor(s): Arman Didandeh

Abstract:Having received the 2014 SIGKDD Test of Time Award, DBSCAN can be labelled asone of the most substantially applicable clustering algorithms, mainly due to its abilityto find clusters with arbitrary shapes. However, a major challenge in using DBSCANis the specification of its two sensitive parameters, namely the maximum neighborhoodradius, and the minimum number of points in the neighborhood. This possibly leads tosub-clusters that are in the vicinity of one another, as well as too many false outliers.One reason can be the fixed neighborhood radius value. To tackle this issue, we proposean adaptive radius solution. In a toy 2-dimensional setting, our algorithm AdaptiveDBSCAN (AdaSCAN) initiates on each point with a small initial neighborhood radius?, and then rewards each point with either a positive or negative reward, depending onthe number of points in its neighborhood. Then the radii are increased by a value ?,until each point has at least m neighbors. The algorithm then uses the reward value ofpoints, their final radii, and the distance to the closest neighbors to calculate circularvicinities, instead of neighborhoods. Points are linked to one another to form clusters,if their vicinities have intersections. The AdaSCAN algorithm is easily and withoutloss of generality expandable to higher dimensions, when using hypersphere instead ofcircles to define vicinities.

• Presenter: Arman Didandeh, Western University

Title: Boosted DBSCAN for Outlier Detection in High-Dimensional Data Spaces

Coauthor(s): Elnaz Bigdeli

Abstract: DBSCAN , as the recipient of the 2014 SIGKDD Test of Time Award, isamong the most substantially applicable clustering algorithms. Upon partially perfectchoices of its parameters namely (1) maximum neighborhood radius and (2) minimumnumber of points in the neighborhood, DBSCAN is capable of finding clusters witharbitrary shapes, as well as outliers and anomalies, specifically in high dimensional data.However, these outliers are highly sensitive to the choice of parameters. To overcome thisimperfection, we propose the use of an ensemble model to obtain a better performancein an outlier detection task, through aggregating the power of alternative DBSCANmodels through a voted outcome. More specifically, using a variety of weaker DBSCANmodels can help reduce both bias and variance in a boosted ensemble models. Wepropose a Boosted ensemble DBSCAN (B-DBSCAN) algorithm that utilizes the outlierdetection potential of different DBSCAN models, constructed over varying parameters.We advocate for limiting the number of models through specifying a range for modelparameters, and to make B-DBSCAN computationally traceable. We randomly samplepairs of DBSCAN parameters from an expanded partially-perfect range to create weak

19

Page 21: HDDA-VI SCIENTIFIC PROGRAM - fields.utoronto.ca fileSession 8 Chair: Farouk Nathoo 16:00 - 16:30 Anand Vidyashankar, George Mason University p.17 Title: Trade-o between E ciency and

models. In order to adjust the ensemble model to possible bias of the partially-perfectrange, we introduce a random noise in terms of parameter selection through the rangeexpansion procedure. We use the standard majority voting procedure, which is usedin the Bagging algorithm, to detect the outliers. In addition, we introduce a Bayesoptimal voting procedure that assigns weighted votes to each weak learner based onthe likelihood of the parameters being sampled from a hypothetical perfect parameterspace.

• Presenter: Mihai Giurcanu, University of Florida

Title: Thresholding Least-Squares Inference in High Dimensional Regression Models

Abstract: We propose a thresholding least-squares method of inference for high-dimensional regression models when the number of parameters, p, tends to infinitywith the sample size, n. Extending the asymptotic behavior of the F-test in highdimensions, we show the consistency and efficiency of the thresholding least-squaresestimators when p = o(n). We propose two automatic thresholding parameter selectionprocedures using Scheffes and Bonferronis methods. We show that, under additionalregularity conditions, the results continue to hold even if p = exp(o(n)). Lastly, weshow that, if properly centered, the residual-bootstrap estimator of the distribution ofthresholding least-squares estimator is consistent, while a naive bootstrap estimatoris inconsistent. In an empirical study, we assess the finite sample properties of theproposed methods for various sample sizes and model parameters. The analysis of areal-world data set illustrates an application of the methods in practice.

• Presenter: Kusha Nezafati, UT Dallas

Title: Bootstrap Methods for Uncertainty Quantification of Network Density Estimates

Coauthor(s): Yuzhou Chen, Slava Lyubchich, Yulia R. Gel

Abstract: Large-scale complex networks are widely used in a variety of applications,from online social media to financial systems to food webs. Despite a vast literatureon graph sampling for estimating network properties, very little is known on how toreliably assess associated estimation errors without imposing restrictive parametric as-sumptions on graph data, while maintaining computationally efficiency. In this poster,we discuss two bootstrap methods that allow quantification of estimation errors fornetwork densities and evaluate their finite sample performance and scalability.

• Presenter: Dengdeng Yu, University of Alberta

Title: Partial and Tensor Quantile Regressions in Functional Data Analysis

Coauthor(s): Linglong Kong, Ivan Mizera

Abstract: In functional linear quantile regression model, we are interested in how toeffectively and efficiently extract the bases for estimating functional coefficients. There-fore, we propose a prediction procedure using partial quantile covariance techniques toextract the functional bases effectively by sequentially maximizing the partial quantilecovariance between the response and projections of functional covariates. Moreover,we develop an efficient algorithm for the procedure. Under the homoscedasticity as-sumption, we further extend our method to functional composite quantile regression by

20

Page 22: HDDA-VI SCIENTIFIC PROGRAM - fields.utoronto.ca fileSession 8 Chair: Farouk Nathoo 16:00 - 16:30 Anand Vidyashankar, George Mason University p.17 Title: Trade-o between E ciency and

using the composite quantile covariance, and obtain the corresponding algorithm. Infunctional linear quantile regression model, the functional coefficients may have multidi-mensional structure. To make efficient predictions without losing the structure informa-tion, we also propose a prediction procedure using tensor linear quantile regression. Inaddition, simulations and real data are studied to show the superiority of our proposedmethods.

21