exploratory analysis for high dimensional extremes: support …zoltan.szabo/jc/2020_04_02... ·...

80
Exploratory analysis for high dimensional extremes: Support identification, Anomaly detection and clustering, Principal Component Analysis. Anne Sabourin 1 + many others: Ma¨ el Chiapino (PhD), Stephan Cl´ emen¸ con (T´ el´ ecom Paris), Holger Drees (U. Hamburg), Vincent Feuillard (Airbus), Nicolas Goix (PhD), Johan Segers (UC Louvain) 1 el´ ecom Paris, Institut polytechnique de Paris, France. Chair Stress test, Ecole Polytechnique, BNPP, 2020/04/02 1/47

Upload: others

Post on 19-Feb-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

  • Exploratory analysis for high dimensional extremes:Support identification, Anomaly detection and

    clustering, Principal Component Analysis.

    Anne Sabourin1

    + many others: Maël Chiapino (PhD), Stephan Clémençon (Télécom Paris), Holger

    Drees (U. Hamburg), Vincent Feuillard (Airbus), Nicolas Goix (PhD), Johan Segers

    (UC Louvain)

    1 Télécom Paris, Institut polytechnique de Paris, France.

    Chair Stress test, Ecole Polytechnique, BNPP, 2020/04/02

    1/47

  • Outline

    Introduction: dimension reduction for multivariate extremes

    Sparsity in multivariate extremes (Goix et al., 2016, 17., Chiapino et al. 2016,2019 a. )

    Application: extremes/anomalies clustering (Chiapino et al., 2019 b.)

    Principal Component Analysis for extremes (S. and Drees, 202+)

    1/47

  • Motivation(s)• Multivariate heavy-tailed random vector X = (X1, . . . ,Xd)

    (e.g. spatial field (temperature, precipitation), asset (negative) prices,. . .

    • Focus on distribution of the largest values: Law(X | ‖X‖ > t), t � 1with P(‖X‖ > t) small.

    Possible goals: simulation (stress test), anomaly detection(preprocessing) among extreme values, . . .

    • d � 1 : modeling Law(X | ‖X‖ > t) unfeasible.

    • Dimension reduction problem(s) :

    1. Identify the groups of features α ⊂ {1, . . . d} which may be largetogether (while the others stay small), given that one of them is large.

    2. Identify a single low undimensional projection subspace V0 such thatLaw(X | ‖X‖ > t) ≈ concentrated on V0.

    2/47

  • Motivation(s)• Multivariate heavy-tailed random vector X = (X1, . . . ,Xd)

    (e.g. spatial field (temperature, precipitation), asset (negative) prices,. . .

    • Focus on distribution of the largest values: Law(X | ‖X‖ > t), t � 1with P(‖X‖ > t) small.

    Possible goals: simulation (stress test), anomaly detection(preprocessing) among extreme values, . . .

    • d � 1 : modeling Law(X | ‖X‖ > t) unfeasible.

    • Dimension reduction problem(s) :

    1. Identify the groups of features α ⊂ {1, . . . d} which may be largetogether (while the others stay small), given that one of them is large.

    2. Identify a single low undimensional projection subspace V0 such thatLaw(X | ‖X‖ > t) ≈ concentrated on V0.

    2/47

  • Motivation(s)• Multivariate heavy-tailed random vector X = (X1, . . . ,Xd)

    (e.g. spatial field (temperature, precipitation), asset (negative) prices,. . .

    • Focus on distribution of the largest values: Law(X | ‖X‖ > t), t � 1with P(‖X‖ > t) small.

    Possible goals: simulation (stress test), anomaly detection(preprocessing) among extreme values, . . .

    • d � 1 : modeling Law(X | ‖X‖ > t) unfeasible.

    • Dimension reduction problem(s) :

    1. Identify the groups of features α ⊂ {1, . . . d} which may be largetogether (while the others stay small), given that one of them is large.

    2. Identify a single low undimensional projection subspace V0 such thatLaw(X | ‖X‖ > t) ≈ concentrated on V0. 2/47

  • Examples: It cannot rain everywhere at the same time

    (daily precipitation)

    (air pollutants)

    3/47

  • Applications to risk managementSensors network (road traffic, river streamflow, temperature, internettraffic . . . ) or financial asset prices:

    → extreme event = traffic jam, flood, heatwave, network congestion,falling price

    → question: which groups of sensors / assets are likely to be jointlyimpacted ?

    → how to define alert regions (alert groups of features/components)?

    spatial case: one feature = one sensor

    4/47

  • Applications to anomaly detection

    • Training step:Learn a ‘normal region’ (e.g. approximate support)

    • Prediction step: (with new data)Anomalies = points outside the ‘normal region’

    If ‘normal’ data are heavy tailed, Abnormal 6⇔ Extreme .There may be extreme ‘normal data’.

    How to distinguish between large anomalies and normal extremes?

    5/47

  • Applications to anomaly detection

    • Training step:Learn a ‘normal region’ (e.g. approximate support)

    • Prediction step: (with new data)Anomalies = points outside the ‘normal region’

    If ‘normal’ data are heavy tailed, Abnormal 6⇔ Extreme .There may be extreme ‘normal data’.

    How to distinguish between large anomalies and normal extremes?

    5/47

  • Applications to anomaly detection

    • Training step:Learn a ‘normal region’ (e.g. approximate support)

    • Prediction step: (with new data)Anomalies = points outside the ‘normal region’

    If ‘normal’ data are heavy tailed, Abnormal 6⇔ Extreme .There may be extreme ‘normal data’.

    How to distinguish between large anomalies and normal extremes?

    5/47

  • Standardized data• Random vectors X = (X1, . . . ,Xd ,) ; Xj ≥ 0

    • Margins: Xj ∼ Fj , 1 ≤ j ≤ d (continuous).

    • Preliminary step: Standardization (here: to Pareto margins)Vj =

    11−Fj (Xj )) , P(Vj > v) =

    1v .

    • Goal : P(V ∈ A), A ’far from 0’ ?

    • Each component j is homogeneous (order −1): for all t > 0

    P(Vj ∈ tA)/P(Vj > t) = P(Vj ∈ A).6/47

  • Multivariate extremes: regular variation• Informally: the marginal homogeneity property remains valid in the

    multivariate sense.• A r.v. V = (V1, . . . ,Vd) ∈ Rd is regularly varying if ∃ a limit

    measure µ s.t.P (V ∈ tA)P (‖V ‖ > t)

    −−−→t→∞

    µ(A)

    for A ⊂ Rd with 0 /∈ closure(A) and µ(∂A) 6= 0.

    • Necessarily µ is homogeneous: µ(rA) = r−αµ(A), for some α > 0 (tailindex).

    • With Vj = 11−Fj (Xj ) necessarily α = 1.7/47

  • Multivariate extremes: regular variation (Cont’d)

    • µ rules the (probabilistic) behaviour of extremes: if A is far from theorigin,

    P(V ∈ A) ≈ µ(A)

    • Examples: Max stable vectors with standardized margins, multivariatestudents, . . .

    • Statistical procedures based on Extreme Value theory: 2 steps.

    1. Learn useful features of µ using the k observations V(1), . . . ,V(k) withlargest norm, with k � n number of available data

    2. Use the approximation P(V ∈ A) ≈ µ(A) for A far from 0.

    8/47

  • Angular measure• Homogeneity of µ → polar coordinates are convenient

    r = ‖x‖ (any norm) ; θ = r−1x• Angular measure Φ on the corresponding sphere:

    Φ(B) = µ{r > 1, θ ∈ B}.• Then µ decomposes as a product, only Φ needs to be estimated:

    µ{r > t, θ ∈ B} = r−αΦ(B)

    9/47

  • Angular measure

    • Φ rules the joint distribution of extremes

    • Asymptotic dependence: (V1,V2) may be large together.

    vs

    • Asymptotic independence: only V1 or V2 may be large.

    No assumption on Φ: non-parametric framework.

    10/47

  • Outline

    Introduction: dimension reduction for multivariate extremes

    Sparsity in multivariate extremes (Goix et al., 2016, 17., Chiapino et al. 2016,2019 a. )

    Application: extremes/anomalies clustering (Chiapino et al., 2019 b.)

    Principal Component Analysis for extremes (S. and Drees, 202+)

    10/47

  • Towards high dimension• Reasonable hope: only a moderate number of Vj ’s may be

    simultaneously large → sparse angular measure

    • Our goal from a MEVT point of view:

    Estimate the (sparse) support of the angular measure(i.e. the dependence structure).

    Which components may be large together, while the other are small?

    11/47

  • Sparse angular support

    Full support: Sparse supportanything may happen (V1 not large if V2 or V3 large)

    Where is the mass?

    Subcones of Rd+ : Cα = {x � 0, xj > 0 (j ∈ α), xj = 0 (j /∈ α), ‖x‖ ≥ 1}α ⊂ {1, . . . , d}.

    12/47

  • Support recovery + representation

    • {Cα, α ⊂ {1, . . . , d}: partition of {x : ‖x‖ ≥ 1}• Goal 1: learn the 2d − 1-dimensional representation (potentially

    sparse)

    M =(µ(Cα)

    )α⊂{1,...,d},α 6=∅

    ; support S = {α : µ(Cα) > 0}.

    Main interest:

    µ(Cα) > 0⇐⇒

    features j ∈ α may be large together while the others are small.13/47

  • Identifying non empty edgesIssue: real data = non-asymptotic: Vj > 0.

    Cannot just count data on each subcone:Only the largest-dimensional one has empirical mass!

    14/47

  • Identifying non empty edges

    Fix ε > 0. Affect data ε-close to an edge, to that edge.

    Cα → Rεα→ New partition of the input space, compatible with non asymptotic data.

    15/47

  • Empirical estimator of µ(Cα))(Counts the standardized points in Cεα, far from 0.)data: Xi , i = 1, . . . , n, Xi = (Xi ,1, . . . ,Xi ,d).

    • Standardize: V̂i ,j = 11−F̂j (Xi,j ) , with F̂j(Xi ,j) =rank(Xi,j )−1

    n

    • Natural estimator

    µ̂n(Cα) =n

    kPn(V̂ ∈

    n

    kRεα) → M̂ = (µ̂n(Cα), α ⊂ {1, . . . , d})

    • Estimated support Ŝ = {α : µ̂n(Cα) > µ0}. 16/47

  • Sparsity in real datasetsData=50 wave direction from buoys in North sea.(Shell Research, thanks J. Wadsworth)

    17/47

  • Finite sample error boundVC-bound adapted to low probability regions (see Goix, S., Clémençon, 2015)

    Theorem

    If the margins Fj are continuous and if the density of the angular measureis bounded by M > 0 on each subface (infinity norm),There is a constant C s.t. for any n, d , k , δ ≥ e−k , ε ≤ 1/4,with probability ≥ 1− δ,

    maxα|µ̂n(Cα)− µ(Cα)| ≤Cd

    (√1

    kεlog

    d

    δ+ Mdε

    )+ Bias n

    k,ε(F , µ).

    Bias: using non asymptotic data to learn about an asymptotic quantity

    Regular variation ⇐⇒ Biast,ε −−−→t→∞

    0

    • Existing litterature (d = 2): 1/√k .

    • Here: 1/√kε+ Mdε. Price to pay for biasing estimator with ε.

    OK if ε k →∞, ε→ 0.Choice of ε: cross-validation or ‘ε = 0.1’ 18/47

  • Tools for the proof

    1. VC inequality for small probability classes (Goix et.al., 2015)

    → max deviations ≤ √p × (usual bound)

    2. Apply it on VC-class of rectangles {kn R(x , z , α), x , z � ε}

    → p ≤ d kεn

    ⇒ supα|µ̂n − µ|(R�α) ≤ Cd

    √1

    εklog

    d

    δ

    3. Approach µ(Cα) with µ(Rεα) → error ≤ Mdε(bounded angular density).

    19/47

  • Algorithm DAMEX (Detecting Anomalies with MultivariateExtremes) (Goix, S., Clémençon, 2016)

    Anomaly = new observation ‘violating the sparsity pattern’:observed in empty or light subcone.

    Scoring function: for x such that v̂ ∈ kRεα,

    sn(x) =1

    ‖v̂‖µ̂n(R

    �α) 'x large P (V ∈ Cα, ‖V ‖ > x)

    20/47

  • Extension: feature clustering (Chiapino, S. 2016; Chiapino, S., Segers,2019 a.)

    • Motivating example: River stream-flow dataset, d = 92 gaugingstations:• Typical groups jointly impacted by extreme records include noisy

    additional features!→ Empirical µ-mass scattered over many Cα’s

    → No apparent sparsity pattern.• How to gather ‘closeby’ α’s into feature clusters? → ‘robust’ version

    of DAMEX: the CLEF algorithm and variants + Asymptotic analysis.21/47

  • Conclusion I

    • Discovering subgroup of components that are likely to besimultaneously large is doable, with an error scaling as 1√

    k, k : number

    of extreme observations.

    • Two algorithms:• DAMEX: easy to implement, linear complexity O(dn log n), not very

    robust to weak signals/ noisy dependence structure

    • CLEF: a bit more complex (graph mining, but existing python packagecan help), complexity OK only if the dependence graph is sparse, butmore robust.

    • Statistical guarantees: non-asymptotic (DAMEX) and asymptotic(CLEF)

    • Open questions: optimal choice of tuning parameters (cross-validationis common practice but no theory in an extreme values context)

    22/47

  • Outline

    Introduction: dimension reduction for multivariate extremes

    Sparsity in multivariate extremes (Goix et al., 2016, 17., Chiapino et al. 2016,2019 a. )

    Application: extremes/anomalies clustering (Chiapino et al., 2019 b.)

    Principal Component Analysis for extremes (S. and Drees, 202+)

    22/47

  • Application: clustering extreme data (Chiapino et al. , 2019 b.)

    • Context: monitoring a high dimensional system (e.g. air flight datafrom Airbus, 82 parameters, 18 000 obs), where extremes are ofparticular interest (associated with anomalies / risk regions)

    • Naive idea: use the list of dependent maximal subgroup {αk , k ≥ K}issued from DAMEX/CLEF and corresponding rectangles of the kind

    tRεα = {x ∈ Rd : xj > tε, j ∈ α, xj < tε, j /∈ α, ‖x‖ > t}

    • Issue: In practice many data points fall outside the tRεα’s. How toassign them to a cluster?

    23/47

  • Mixture model for extremes

    • See dependent subgroups αk ⊂ {1, . . . , d}, k ≤ K issued fromDAMEX/CLEF as components of a mixture model.

    • Zk : hidden indicator variable of component k . Conditionally to{‖V ‖ > r0,Zk = 1},

    V = Vk + �k = RkWk + �k , (1)

    where

    • Vk ∈ Cαk ,

    • �k ∈ C⊥αk ∼ i.i.d. Exponential,

    • Rk = ‖Vk‖ ∼ Pareto(1),

    • Wk = R−1k Vk ∈ Sαk ∼ Φk (Angular measure restricted to k th face:Dirichlet distribution with the L1 norm).

    24/47

  • Model for the k th mixture component

    V = Vk + �k = RkWk + �k , (2)

    • Training the mixture model: EM algorithm.

    25/47

  • Clustering extremes

    • After training: each extreme point Vi has probabibility pi ,k to comefrom mixture component k.

    • Similarity measure between Vi ,Vj :

    si ,j = P (Vi ,Vj ∈ same component ) =K∑

    k=1

    pi ,kpj ,k

    → Similarity matrix (Si ,j) ∈ [0, 1]N×N where N is the number ofextreme components.

    • Clustering based on the similirarity matrix using off-the-shelftechniques (e.g. spectral clustering)

    26/47

  • Some results

    Table: Shuttle dataset (9 attributes, 7 classes): Purity score - Comparisons withstandard approaches for different extreme sample sizes.

    n0 = 500 n0 = 400 n0 = 300 n0 = 200 n0 = 100Dirichlet mixture 0.8 0.82 0.82 0.84 0.85Kmeans 0.72 0.73 0.75 0.78 0.8Spectral clustering 0.78 0.77 0.82 0.81 0.8

    27/47

  • Flights anomaly clustering + Visual display

    Using standard visualization tools (Python package Networkx)

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    11

    12

    13

    1415

    16

    17

    18

    19

    20

    21

    22

    23

    24

    25

    26

    27

    28

    29

    30

    31

    32

    33

    34

    35

    36

    37

    38

    39

    40

    41

    42

    43

    44

    45

    46

    47

    48

    49

    50

    51

    52

    53

    54

    55

    56

    57

    58

    59

    60

    61

    62

    63

    64

    65

    66

    67

    68

    69

    70

    71

    72

    73

    74

    75

    76

    77

    78

    79

    80

    81

    82

    83

    84

    85

    86

    87

    88

    89

    90

    91

    92

    93

    94

    95

    96

    97

    98

    99

    100

    101

    102

    103

    104

    105

    106

    107

    108

    109

    110

    111

    112

    113

    114

    115116

    117

    118

    119

    120

    121

    122

    123

    124

    125

    126

    127

    128

    129

    130

    131

    132

    133

    134

    135

    136

    137

    138

    139

    140

    141

    142

    143

    144

    145

    146

    147

    148

    149

    150

    151

    152

    153

    154

    155

    156157

    158

    159

    160

    161

    162

    163

    164

    165

    166

    167 168

    169

    170

    171

    172

    173

    174

    175

    176

    177

    178

    179

    180

    181

    182

    183

    184

    185

    186

    187

    188

    189

    190

    191

    192

    193

    194

    195

    196

    197

    198

    199

    200

    201 202

    203

    204

    205

    206

    207

    208

    209

    210

    211

    212

    213

    214215

    216

    217

    218

    219

    220

    221

    222

    223

    224

    225

    226

    227

    228

    229

    230

    231

    232

    233

    234

    235

    236

    237

    238 239

    240

    241

    242

    243

    244

    245

    246

    247

    248

    249

    250

    251

    252

    253

    254

    255

    256

    257

    258

    259

    260

    261

    262

    263

    264

    265

    266

    267

    268

    269

    270

    271

    272

    273

    274

    275

    276277

    278

    279

    280

    281

    282

    283

    284

    285

    286

    287

    288

    289

    290

    291

    292

    293

    294

    295

    296

    297

    spect

    ral cl

    ust

    ers

    28/47

  • Conclusion II

    • DAMEX/CLEF output can be used to perform clustering of extremes

    • Mixture modelling,{ mixture components } = output of DAMEX/CLEF

    • Additional layer: building a similarity matrix to perform clustering.

    • Question: What if the angular mesure is concentrated on an’oblique’ subspace of the central subsphere (only particular linearcombinations of all components are likely to be large)? ThenCLEF/DAMEX fail because all the mass is in the central subsphereand no particular structure can be discovered.→ Idea: perform some sort of PCA on extreme data.

    29/47

  • Outline

    Introduction: dimension reduction for multivariate extremes

    Sparsity in multivariate extremes (Goix et al., 2016, 17., Chiapino et al. 2016,2019 a. )

    Application: extremes/anomalies clustering (Chiapino et al., 2019 b.)

    Principal Component Analysis for extremes (S. and Drees, 202+)

    29/47

  • PCA for extremes: context and motivation

    • (X1, . . . ,Xd) a multivariate r. vector with tail index α > 0 and limitmeasure µ

    • Motivating assumption (not necessary)

    Hypothesis 1

    The vector space S0 = span(suppµ) generated by the support of µ hasdimension p < d .

    • Purpose of this work: Recover S0 from the data, with guaranteesconcerning the reconstruction error.

    30/47

  • Motivating assumption: interpretation

    dim(S0) = p < d ; S0 = span(supp(µ))

    ⇐⇒

    Certain linear combinations are much likelier to be large than others.

    31/47

  • Dimension reduction in EVT: quick overview

    • Looking for multiple subspaces where µ concentrates:

    • Chautru, 2015 (clustering + principal nested spheres)

    • Goix et al., 2016,17, Chiapino et al. (space partitioning) Simpson etal., 20++ (relaxing the partition)

    • Engelke&Hitz, 20++, Graphical models

    • K-means clustering: Janssen &Wan, 20++

    • Dimension reduction in regression analysis: Gardes, 2018.

    • PCA on a transformed version of the data: Cooley & Thibaud(20++)

    32/47

  • Heavy-tailed scarecrow against using PCA for extremes

    • ‘Classical dimension reduction tools such as PCA fail for multivariateextremes because they require the existence of second moments’

    • Possible answer: Since µ is homogeneous, what matters is theangular component:

    Proposed method for recovering the support of µ

    • Perform PCA on angular data (or any rescaled version of the data withenough moments) corresponding to observations with largest norm.

    • The first eigen vectors of the rescaled empirical covariance matrix provide anestimate for S0 = span(supp(µ)).

    33/47

  • Toy example

    ●●

    ●●

    ●●●

    ●●

    ● ●●

    ●●

    ●●

    ●●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    34/47

  • Toy example

    ●●

    ●●

    ●●●

    ●●

    ● ●●

    ●●

    ●●

    ●●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    V_0

    34/47

  • Toy example, proposed method

    ●●

    ●●

    ●●●

    ●●

    ● ●●

    ●●

    ●●

    ●●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ● ●●

    ●●

    ●●

    ●●●

    ● ●

    ●●

    ●●

    ●●

    35/47

  • Toy example, proposed method

    ●●

    35/47

  • Toy example, proposed method

    ●●

    35/47

  • Toy example, proposed method

    ●●

    35/47

  • Toy example, proposed method

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    35/47

  • Empirical Risk Minimization setting• ‖ · ‖: Euclidean norm.• ΠS (resp. Π⊥S ): Orthogonal projection operator onto linear space S

    (resp. S⊥)

    • Rescaled observations: Θ = θ(X ) = ω(X ) · X ,• ω : Rd → R+: suitable scaling function (think ω(x) = 1/‖x‖, variants

    allowed s.t. E(‖Θ‖2

    ) t

    )• Empirical counterpart

    Rn,k(S) =1

    k

    n∑i=1

    ‖Π⊥S (Θ(i))‖2

    with ‖X‖(1) ≥ . . . ‖X‖(n) the order statistics of the norm and Θ(i) thecorresponding rescaled data.

    36/47

  • Minimizing a risk ⇐⇒ Diagonalizing a covariance matrix• Denote Eq = set of all q-dimensional subspaces of Rd , 1 ≤ q ≤ d .

    • Σt = E(ΘΘ> | ‖X‖ > t

    ): conditional second moments matrix.

    • standard fact from Principal Component analysis: Assume forsimplicity Σt has distinct eigenvalues. Let (u1, . . . ud) denote theeigen vectors associated to eigen values in decreasing order. Then

    arg minS∈Eq

    Rt(S) = span(u1, . . . , uq).

    • Similarlyarg minS∈Eq

    Rn,k(S) = span(un1 , . . . , u

    nq).

    (unj ): eigen vectors of the empirical 2nd moments matrix of the

    Θ(i), i ≤ k

    37/47

  • ERM setting: risk at the limit

    • Limit risk above extreme levels

    R∞(S) := E∞‖Π⊥S Θ‖2

    where E∞: expectation w.r.t. the limit conditional distribution

    P∞( · ) = limt→∞

    P (X ∈ t( · ) | ‖X‖ > t) = µ( · )/µ({x : ‖x‖ > 1})

    • Hypothesis 1 (S0 = span suppµ, dim(S0) = p) ⇒{S0} = arg minEp R∞, R∞(S0) = 0,

    For all S ′ of dimension p′ < p, R∞(S′) > 0.

    38/47

  • Questions

    • Is the empirical minimizer Ŝn of Rn,k consistent?

    • Uniform, non-asymptotic bounds on |Rn,k(S)− Rtn,k (S)|:? (classicalgoal in statistical learning)

    • Relevance for practical applications (improved performance fornon-parametric estimation of the probability of failure regions) ?

    39/47

  • Convergence of minimizers of the true conditional risk• Scaling condition on the weight ω (→ second moments of Θ exist).

    ∃β ∈(1− α

    2, 1]

    : ∀λ > 0, x ∈ Rd : ω(λx) = λ−βω(x)

    and cω := sup‖x‖=1

    ω(x)

  • Uniform risk bound• Stronger condition on ω: ω(x) ≤ 1/‖x‖ (thus ‖Θ‖ ≤ 1).• tn,k : quantile of level 1− k/n for ‖X‖• St := E

    (‖Θ‖4 − πt tr(Σ2t ) | ‖X‖ > t

    ); Σt = E

    (ΘΘ> | ‖X‖ > t

    )Theorem 3 (Drees, S., 20++), simplified version

    With probability at least 1− δ,

    supS∈Ep|Rn,k(S)− Rtn,k (S)| ≤

    [p ∧ (d − p)k

    Stn,k

    ]1/2+ . . .

    . . .[8k

    (1 + k/n) log(4/δ)]1/2

    + . . .

    . . .4 log(4/δ)

    3k.

    (Variant of bounded difference inequality (McDiarmid, 98) + argumentsfrom Blanchard et al. 07))

    • NB: unknown term St : an alternative statement is proven with onlyempirical quantities in the upper bound.

    41/47

  • Simulations: questions

    • Can p = dim(V0) be chosen empirically from the risk plots?

    • Does the empirical angular measure after projection on the subspacelearned by PCA provide better estimates than the classical one for therisk-related quantities:

    (i) limu→∞ P(p−1∑

    1≤j≤p Xj/‖X‖ > t(i) | ‖X‖ > u) =

    H{x | p−1∑p

    j=1 xj > t(i)} for some t(i) ∈ (0, p−1/2)

    (ii) limu→∞ P(min1≤j≤p X j > u,maxp+1≤j≤d X j ≤ u | ‖X‖ > u) =∫ ((min1≤j≤p x

    j)α − (maxp+1≤j≤d x j)α)+

    H(dx)

    (iii) limu→∞ P(X 1 > u | max1≤j≤d X j > u) =∫(x1)α H(dx)/

    ∫(max1≤j≤d x

    j)α H(dx)

    (iv) limu→∞ P(min1≤j≤d X j > u | ‖X‖ > u) =∫

    (min1≤j≤d xj)α H(dx)

    42/47

  • Simulations: models

    • d dimensional vectors with limit measure concentrated on a p < ddimensional subspace.

    • Structure: p dimensional max-stable model + d-dimensional Gaussiannoise (absolute values), ρ = 0.2, σ2 ∈ {105/d , 10/d}.

    • Unit Fréchet margins with tail index α ∈ {1, 2}

    • Dependence for the p-dimensional model:• Max-stable vector from the Dirichlet Model (Coles&Tawn91, see

    Segers, 2012 for simulation), parameter (3, . . . , 3).

    • other settings (not shown here)

    • n = 1000, k ∈ {5, 10, 15, . . . , 200}, 1000 replications.

    43/47

  • choice of p̂, Dirichlet model, p = 2, d = 10

    0 50 100 150 2000

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3

    0.35

    0 50 100 150 2000

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3

    0.35

    Mean

    empirical risk (left) and empirical risk for one sample (right) versus k forPCA projecting onto a subspace of dimension 1 ≤ p̃ ≤ 10

    → Choice p̂ = 2 obvious for small k , p̂ ∈ {2, 3} for k ≥ 50.44/47

  • Performance for estimating failure probabilities: RMSE’srelated to the angular measure H

    0 50 100 150 2000

    0.2

    0.4

    0.6(i)

    0 50 100 150 2000

    0.05

    0.1

    0.15

    0.2

    (ii)

    0 50 100 150 2000

    0.05

    0.1

    0.15

    (iii)

    0 50 100 150 2000

    0.05

    0.1

    0.15

    0.2(iv)

    RMSE’s based on Ĥn,k (black, solid), ĤPCAn,k (blue, dashed) and Ĥ

    PCAn,k,10 (red, dash-dotted)

    versus k in the Dirichlet model with parameter 3, p = 2 and d = 10.

    • PCA step with 10 observations → estimators relatively insensitive tothe choice of k for Ĥn,k . 45/47

  • Conclusion III (PCA)

    • Plotting the empirical risk is useful to choose p̂

    • In case of doubt, choose the highest plausible dimension.

    • For estimating failure probabilities: estimators including a PCA stepare competitive, for probability (i) [concomitance of extremes] theyare superior.

    • choosing kPCA < k offers improved robustness w.r.t choice of k in thesecond step.

    46/47

  • Bibliography I

    • Blanchard, G., Bousquet, O., & Zwald, L. (2007). Statistical properties of kernelprincipal component analysis. Machine Learning, 66(2-3), 259-294.

    • Chautru, E. (2015). Dimension reduction in multivariate extreme value analysis.Electronic journal of statistics, 9(1), 383-418.

    • Chiapino, M., Sabourin, A. (2016). Feature clustering for extreme eventsanalysis, with application to extreme stream-flow data. In InternationalWorkshop on New Frontiers in Mining Complex Patterns (pp. 132-147).Springer, Cham.

    • Chiapino, M., Sabourin, A., Segers, J. (2019). Identifying groups of variableswith the potential of being large simultaneously. Extremes, 22(2), 193-222.

    • Chiapino, M., Clémençon, S., Feuillard, V., Sabourin, A. (2019). Amultivariate extreme value theory approach to anomaly clustering andvisualization. Computational Statistics, 1-22.

    • Cooley, D., & Thibaud, E. Decompositions of dependence for high-dimensionalextremes. arXiv:1612.07190.

    • J-J. Cai, J. Einmahl, and L. De Haan. ”Estimation of extreme risk regions undermultivariate regular variation.” AoS,2011

    46/47

  • Bibliography II

    • N. Goix, A. S., S. Clémençon. ”Learning the dependence structure of rare events:a non-asymptotic study”,.COLT, 2015

    • Drees, H. & Sabourin, A. Principal Component Analysis for multivariateextremes, arXiv:1906.11043

    • Engelke, S., & Hitz, A. S. Graphical models for extremes. arXiv:1812.01734.• Gardes, L. (2018). Tail dimension reduction for extreme quantile estimation.

    Extremes, 21(1), 57-95.

    • Goix, N., Sabourin, A., Clémençon, S. (2016). Sparse Representation ofMultivariate Extremes with Applications to Anomaly Ranking. In AISTATS(pp. 75-83).

    • N. Goix, A. S., and S. Clémençon. (2017) Sparse representation ofmultivariate extremes with applications to anomaly detection. JMVA

    • Simpson, E. S., Wadsworth, J. L., & Tawn, J. A. Determining the DependenceStructure of Multivariate Extremes. arXiv:1809.01606.

    • Janssen, A., & Wan, P. k-means clustering of extremes. arXiv:1904.02970.

    47/47

  • More material on DAMEX

    1/19

  • Estimation of the dependence structure: Φ(B) or µ[0, x ]c

    • Flexible multivariate models for moderate dimension (d ' 5)Dirichlet Mixtures (Boldi,Davison 07; S., Naveau 12), Logistic family (Stephenson

    09, Fougères et.al, 13), Pairwise Beta (Cooley et.al) . . .

    • Asymptotic theory: rates under second order conditions(Einmahl, 01) Empirical likelihood (Einmahl, Segers 09) Asymptotic normality

    (Einmahl et. al., 12, 15) (parametric)

    • Finite sample error bounds, non parametric, on

    supx�R|µ̂n[0, x ]c − µ[0, x ]c |

    (Goix, S., Clémençon, 15)

    Does not tell ‘which components may be large together’

    2/19

  • DAMEX results: support recovery

    • Asymmetric logistic, d = 10, dependence parameter α = 0.1→ Non asymptotic data (not exactly Generalized Pareto)• K randomly chosen (asymptotically) non-empty faces.• parameters: k =

    √n, � = 0.1

    • Heuristic for setting minimum mass µ0: eliminate faces supportingless than 1% of total mass.

    # sub-cones K 10 15 20 30 35 40 45 50

    Aver. # errors 0.01 0.09 0.39 1.82 3.59 6.59 8.06 11.21(n=5e4)

    Aver. # errors 0.06 0.02 0.14 0.98 1.85 3.14 5.23 7.87(n=15e4)

    3/19

  • More material on CLEF and variants

    4/19

  • CLEF method: Relaxed constraints on the region ofinterest

    Initial regions of interest:

    Cα = {v � 0 : v j large for j ∈ α, v j small for j /∈ α}Question: µ(Cα) > 0?

    Modified regions (relaxed constraints, larger regions, more points)

    {v � 0 : v j large for j ∈ α}

    more precisely

    Γα = {v � 0 : ∀j ∈ α, v j > 1}; µ(Γα) = lim tP(∀j ∈ α,V j > t

    ).

    Alternative question: µ(Γα) > 0 ?

    5/19

  • Problem statement (CLEF)

    Goal Estimate the family of subsets

    S = {α ⊂ {1, . . . , d} : µ(Γα) > 0}

    Recall the initial problem: estimate S = {α : µ(Cα) > 0}.

    Lemma: ‘Equivalence’ of the two problems DAMEX/CLEF

    α is a maximal element of S⇐⇒α is a maximal element of S

    6/19

  • Conditional criterion in CLEF• One needs an empirical criterion for ‘testing’ dependence: µ(Γα > 0).

    e.g. µ̂n(Γα) > µ0.

    • Issue: µ(Γα)↘ as |α| ↗How to set the threshold according to |α| ?

    • Way around: condition upon joint exceedance of ‘all but one’components.

    κα = limt→∞

    P(∀j ∈ αV j > t | V j > t for all but at most one j ∈ α}

    Empirical criterion

    κ̂α,t =

    ∑ni=1 1V̂ ji >t for all j∈α∑n

    i=1 1V̂ ji >t for all but at most one j∈α

    Ŝ = {α : κ̂t > κ0, β ∈ M̂ for β ⊂ α}Ŝ0 = {maximal such α′s}

    7/19

  • Coping with combinatorial complexity: O(2d) subsets!• Good news: µ(Γα) = 0⇒ ∀β ⊃ α, µ(Γβ) = 0 → follow Hasse diagram

    CLEF algorithm (CLustering Extreme Features Chiapino, S., 16)

    • Start with pairs: Â2 = {α : |α| = 2, κ̂(α) > κ0},with κ̂: a dependence summary (next slide)

    • Stage k : Âk = {α : |α| = k , κ̂(α) > κ0}; → Candidates for Ak+1:

    {α : |α| = k + 1, ∀β ⊂ α s.t. |β| = k , β ∈ Âk .} Not too many !

    • If Âk = ∅, return M̂0 = ∪j

  • Refined statistical analysis (Chiapino et al., 2019 a.)

    How to turn the heuristic stopping criterion ‘κ̂α < κ0’ into astatistical test with controllable asymptotic level?

    9/19

  • Strategy 1: Testing H0 : κα > κ0

    Theorem (Chiapino, S., Segers, 2019)

    Under the conditions from Einmahl et al., 12, th. 4.6 (2nd order + smooth-

    ness of `), if κα > 0,√k(κ̂α − κα

    )W−→ Zκ,α: a centered Gaussian variable

    which variance involves unknown quantities that can be estimated empiri-cally.

    • proof: κα is a function of the (λβ : β ⊂ α) withλβ = µ{v ∈ Rd+ : ∃j ∈ α : xj > 1) (extremal coefficient).+ ∆-method.

    • Corollary: τα,n = 1{κ̂α < κ0 + qδ

    √σ̂2αk

    }where qδ : δ-quantile of

    N (0, 1) is a test for H0 : κ > κ0 of asymptotic level δ.

    10/19

  • Strategy 2: estimating a tail dependence coefficient ηα.

    • Issue with strategy 1: κ0 > 0 is arbitrary but unavoidable. IndeedTesting ‘µ(Γα) > 0’ is not manageable in this framework because the

    limit distribution degenerates for µ(Γα)→ 0 (or for κα → 0).

    • Alternative framework: Ledford & Tawn’s model (multivariate case:de Haan, Zhou, 11 ; Eastoe, Tawn, 12).

    P(∀j ∈ α : Vj > t) = t−1ηαLα(t), where Lα is slowly varying. (3)

    • For our purposes: µ(Γα) > 0⇒ (3) holds with ηα = 1.

    • Consequence: any test for H̃0: ηα = 1 (simple hypothesis) is also atest for H0 : µ(Γα) > 0.

    11/19

  • • Estimation: using a Pickands estimator η̂α,P (Peng, 99, bivariate case)or a Hill estimator η̂α,H (Draisma et al. (04) and Drees (98 a,b)).

    • Technical challenge: accounting for unknown margins (V← V̂ usingempirical cdf)

    • Theorems (Chiapino, S., Segers, 2019): Under H0,√k(η̂α,P − 1)

    and√k(η̂α,H − 1) both converge towards Gaussian limits which

    variances involve unknown quantities that can be estimatedempirically.

    • Tools for the proofs:• for η̂α,P : ∆-method + Einmahl et al. (12)• for η̂α,H : Extending Draisma et al. (04) and Drees (98 a,b)’s proofs for

    the bivariate case to the multivariate case.

    12/19

  • Experiments: d = 100, asymmetric logistic (AL) modelwith random perturbation• Comparison between CLEF and the three considered variants

    (heuristic stopping criterion replaced with a test)

    • M = {α : µ(Γα) > 0} = 80 randomly chosen subsets.

    • perturbation: For each i ≤ n, Xi ∼ perturbed AL distribution: eachα ∈M augmented with a randomly chosen ji ∈ {1, . . . , d \ α.

    • DAMEX algorithm (Goix, S, Clémençon, 16,17) fails (miserably)

    # true recovered # false α ⊂ β ∈ M # false α ⊃ β ∈ M # other falseη̂H 79.0 (1.4) 2.4 (3.4) 0.04 (0.2) 18.0 (7.0)

    η̂P 79.6 (0.7) 1.0 (2.5) 0. (0.) 3.4 (2.8)κ̂ 71.1 (2.3) 7.4 (4.7) 5.1 (2.1) 28.0 (13.3)

    CLEF 69.9 (4.4) 16.2 (8.1) 0.5 (0.6) 2.3 (2.2)50 datasets, n = 5e4, k = 150, conf. level 0.001 ; κ0 = 0.05 (CLEF).

    Tests based on η̂P are modified to accommodate for the case µ̂(Γα) ≤ 0.05. 13/19

  • More on PCA for extremes

    14/19

  • PCA for extremes: Uniform risk bound II

    • R̂t : empirical risk conditional to {‖X‖ > t}.• Data dependent bound (with empirical moments instead of St):

    Theorem 4 (Drees, S., 20++)

    For all ` > 1, u, v > 0,

    P(

    supS∈Ep|R̂t(S)− Rt(S)| ≥

    [(p ∧ (d − p))

    ( S̃t`− 1

    +v

    `

    )]1/2+ u

    ∣∣∣Nt = `)≤ 2 exp

    (− 2`u2) + exp

    (− b`/2cv2/2

    )with S̃t := N

    −1t

    ∑ni=1 ‖Θi ,t‖4 − tr

    ((N−1t

    ∑ni=1 Θi ,tΘ

    >i ,t)

    2)

    and

    Θi ,t = Θi1{‖Xi‖ > t}

    (adapting arguments from Blanchard et al. 07, conditioning upon‖X‖ > t)• Corollary: confidence interval for Rt

    15/19

  • Mean distance ρ(Ŝn,k , S0) vs. k

    0 20 40 60 80 100 120 140 160 180 2000

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    ∣∣∣∣∣∣∣∣∣ΠŜn,k −ΠŜ0∣∣∣∣∣∣∣∣∣ as a function of k in the Dirichlet model with parameter 3,p = 2 and d = 10

    • Performance deteriorates quickly → choosekPCA = 10 < k(angular measure).

    16/19

  • PCA: Choice of p̂, p = 5, d = 100

    0 50 100 150 2000

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3

    0.35

    0.4

    0 50 100 150 2000

    0.2

    0.4

    0.6

    0.8

    1

    Mean empirical risk for PCA projecting onto a subspace of dimension1 ≤ p̃ ≤ 10 (left) and distance ρ(Ŝn,k , S0) with p̂ = p (right) versus k in

    the Dirichlet model with parameter 3.

    • p̂ ∈ {4, 5, 6}: reasonable. kPCA should remain small (again)

    17/19

  • RMSE’s for probabilities (i)–(iv), Dirichlet model,d = 100, p = 5.

    0 100 2000

    0.2

    0.4

    0.6(i)

    0 100 2000

    0.05

    0.1

    (ii)

    0 100 2000

    0.1

    0.2(iii)

    0 100 2000

    0.02

    0.04

    (iv)

    0 100 2000

    0.2

    0.4

    0.6(i)

    0 100 2000

    0.05

    0.1(ii)

    0 100 2000

    0.1

    0.2(iii)

    0 100 2000

    0.02

    0.04

    (iv)

    0 100 2000

    0.2

    0.4

    0.6(i)

    0 100 2000

    0.05

    0.1(ii)

    0 100 2000

    0.1

    0.2(iii)

    0 100 2000

    0.02

    0.04

    (iv)

    Estimators based on Ĥn,k (black), ĤPCAn,k (blue, dashed) and Ĥ

    PCAn,k,10 (red,

    dash-dotted) vs. k.18/19

  • RMSE’s for probabilities (i)–(iv), randomly rotatedDirichlet model, d = 10, p = 3.

    0 100 2000

    0.2

    0.4

    0.6(i)

    0 100 2000

    0.05

    0.1

    0.15

    0.2

    (ii)

    0 100 2000

    0.05

    0.1

    0.15

    (iii)

    0 100 2000

    0.05

    0.1

    0.15

    0.2(iv)

    0 100 2000.1

    0.2

    0.3

    0.4

    0.5

    (i)

    0 100 2000

    0.05

    0.1

    0.15

    0.2

    (ii)

    0 100 2000

    0.05

    0.1

    0.15

    (iii)

    0 100 2000

    0.05

    0.1

    0.15

    0.2(iv)

    Estimators based on Ĥn,k (black, solid), ĤPCAn,k (blue, dashed) and Ĥ

    PCAn,k,10

    (red, dash-dotted) vs. k .19/19

    Introduction: dimension reduction for multivariate extremesSparsity in multivariate extremes (Goix et al., 2016, 17., Chiapino et al. 2016, 2019 a. ) Application: extremes/anomalies clustering (Chiapino et al., 2019 b.)Principal Component Analysis for extremes (S. and Drees, 202+)Appendix