abstract arxiv:2103.04556v2 [cs.ds] 14 apr 2021

T-SCI: A Two-Stage Conformal InferenceAlgorithm with Guaranteed Coverage for Cox-MLP

Jiaye Teng * 1 Zeren Tan * 1 Yang Yuan 1

AbstractIt is challenging to deal with censored data, wherewe only have access to the incomplete informationof survival time instead of its exact value. Fortu-nately, under linear predictor assumption, peoplecan obtain guaranteed coverage for the confidenceband of survival time using methods like Cox Re-gression. However, when relaxing the linear as-sumption with neural networks (e.g., Cox-MLP(Katzman et al., 2018; Kvamme et al., 2019)),we lose the guaranteed coverage. To recover theguaranteed coverage without linear assumption,we propose two algorithms based on conformalinference under strong ignorability assumption.In the first algorithm WCCI, we revisit weightedconformal inference and introduce a new non-conformity score based on partial likelihood. Wethen propose a two-stage algorithm T-SCI, wherewe run WCCI in the first stage and apply quan-tile conformal inference to calibrate the results inthe second stage. Theoretical analysis shows thatT-SCI returns guaranteed coverage under milderassumptions than WCCI. We conduct extensiveexperiments on synthetic data and real data usingdifferent methods, which validate our analysis.

1. IntroductionIn survival analysis, censoring indicates that the value ofinterest (survival time) is only partially known (e.g., theinformation can be t > 5 instead of t = 7). It is commonand inevitable in numerous fields, including medical care(Robins & Finkelstein, 2000; Klein & Moeschberger, 2006),astronomy (Feldmann, 2019), finance (Bellotti & Crook,2009), etc. It is an annoying issue since ignoring or deletingcensored data causes bias and inefficiency (Nakagawa &Freckleton, 2008), as illustrated in Figure 1.

When dealing with censored data, we usually focus on theconfidence band of the survival time since confidence bands

∗Equal contribution 1IIIS, Tsinghua University, China. Email:{tjy20, tanzr20}@mails.tsinghua.edu.cn. Correspondence to:Yang Yuan < [email protected] >.

Figure 1. Bias in censoring. Ignoring or deleting censored data(light green points) causes bias compared to the ground truth linearapproximation (red line), where the linear approximation undercensoring (black line) does not overlap the ground truth (red line).

give a more conservative estimation than point estimation.Under linear assumptions on covariate effect (See Assump-tion 1), one can derive the survival time distribution usingCox regression by asymptotic normality of linear coefficient.It further leads to guaranteed coverage, meaning that sur-vival time provably falls into the confidence band with highprobability (larger than the given confidence level).

However, the linear assumption broadly harms its perfor-mances and restricts its applications. After all, the reality isnot always entirely linear. To relax the linear assumption,Katzman et al. (2018); Kvamme et al. (2019) applies neuralnetworks into Cox regression (Cox-MLP), yielding the bestperformance in terms of some metrics such as Brier scoreand binomial log-likelihood. Unfortunately, the confidenceband in Cox-MLP has no guaranteed coverage since onecannot expect neural network converges to the expectedfunction, not to mention the asymptotic normality.

In general, we can use conformal inference to recover theconfidence band with guaranteed coverage by splitting the

arX

iv:2

103.

0455

6v2

[cs

.DS]

14

Apr

202

1

T-SCI: A Two-Stage Conformal Inference Algorithm with Guaranteed Coverage for Cox-MLP

dataset into training and calibration set (Vovk et al., 2005;Nouretdinov et al., 2011; Lei & Candes, 2020). One ofits advantages is that conformal inference does not harmthe model performance since it is post-hoc. Therefore, wepropose to apply conformal inference into Cox-MLP.

When applying conformal inference into Cox-MLP, thereare several problems to be solved. Firstly, Cox regressiondoes not return survival time explicitly, which requires amodification of the non-conformity score. Secondly, censor-ing causes covariate shift under strong ignorability, meaningthat the covariate distribution differs in censored and uncen-sored data. Therefore, we cannot apply conformal inferencedirectly. Thirdly, we need to consider the estimation errorand provide theoretical guarantees for the coverage.

In this paper, we propose a new non-conformity score un-der strong ignorability assumption (which is a standard as-sumption in weighted conformal inference. We refer tomore details in Section 3) based on the partial likelihoodof Cox regression. This non-conformity score does notneed an explicit estimation of the survival time. We thenapply weighted conformal censoring inference (WCCI), aweighted conformal inference based on this non-conformityscore inspired by Tibshirani & Foygel (2019) to deal withthe covariate shift problem. Furthermore, inspired by Ro-mano et al. (2019), we provide a two-stage conformal in-ference (T-SCI) which returns “nearly perfect” coverage,meaning that the coverage has not only guaranteed lowerbound but also upper bound. Inspired by (Lei & Candes,2020), we provide theoretical guarantees for both WCCIand T-SCI algorithms.

We summarize our contributions as follows:

• We provide coverage for Cox-MLP in WCCI based onweighted conformal inference frameworks by introduc-ing a new non-conformity score.

• We further propose a T-SCI algorithm based on thequantile conformal inference framework. We showthat T-SCI returns nearly perfect guaranteed coverage,namely, theoretical guarantees for coverage’s lowerand upper bound.

• We conduct extensive experiments on both syntheticdata and real-world data, showing that the T-SCI-basedalgorithm outperforms other approaches in terms ofempirical coverage and interval length.

2. Related workCensored data analysis. An early analysis of censoreddata can be dating back to the famous Kaplan–Meier esti-mator Kaplan & Meier (1958). However, this approach is

valid only when all patients have the same survival func-tion. Therefore, several individual-level analysis is pro-posed, such as proportional hazard model (Breslow, 1975),accelerated failure time model (Wei, 1992) and Tree-basedmodels (Zhu & Kosorok, 2012; Li & Bradic, 2020).

On the other hand, researchers apply machine learning tech-niques to deal with censored data (Wang et al., 2019). Forexample, random survival forests (Ishwaran et al., 2008)train random forests using the log-rank test as the splittingcriterion. Moreover, DeepHit (Lee et al., 2018) apply neu-ral networks to estimate the probability mass function andintroduce a ranking loss. This paper mainly focuses on theCox-based model, one of the famous proportional hazardmodel branches.

Cox regression was first proposed in Cox (1972), whichis a semi-parametric method focusing on estimating thehazard function. Among all its extensions, Akritas et al.(1995) first proposed using a one-hidden layer perceptronto replace the linear predictor of the coefficient. However,it generally failed mainly due to the low expressivity of theone-hidden layer perceptron (Xiang et al., 2000; Sargent,2001). Therefore, Katzman et al. (2018) proposed to use themulti-layer perceptron instead of the one-layer perceptron(DeepSurv). Furthermore, (Kvamme et al., 2019) generalizethe idea to the non-proportional hazard settings. In thispaper, we unify their names as Cox-MLP when the contextis clear. However, this line of work lacks a theoreticalguarantee. This paper tries to fill this blank and propose thefirst guaranteed coverage of the survival time.

Conformal inference was pioneered by Vladimir Vovk andhis collaborators [e.g., Vovk et al. (2005); Shafer & Vovk(2008); Nouretdinov et al. (2011)], focusing on the infer-ence of response variables by splitting a training fold anda calibration fold. There are several variations of confor-mal inference. For example, weighted conformal inference(Tibshirani & Foygel, 2019) focus on dealing with the co-variate shift phenomenon, and quantile conformal inference(Romano et al., 2019) returns coverage with not only anupper bound but also the lower bound guaranteed coverage.Conformal inference, as well as its variations, are widelystudied and used in Lei et al. (2013); Lei & Wasserman(2014); Lei et al. (2018); Barber et al. (2019b); Sadinle et al.(2019); Romano et al. (2020).

Recently, Lei & Candes (2020) apply conformal inferenceunder counterfactual settings and derive a double robustguarantee for their proposed methods. Our work is partiallyinspired by Lei & Candes (2020) but considers a differentcensoring setting. Furthermore, we propose a different algo-rithm and derive a “nearly perfect” guaranteed coverage. Avery recent work (Candes et al., 2021) focuses on a similarcensoring scenario. Unlike our approach (T-SCI), Candeset al. (2021) relax the strong ignorability assumption but re-


quire all information of censoring time to obtain confidencebands. We emphasize that it is still an open problem on de-riving guaranteed confidence bands under general censoringscenarios.

There are some approaches to apply conformal inferenceunder censoring settings. For example, Bostr et al. (2017);Bostrom et al. (2019) apply conformal inference into ran-dom survival forests, and Chen (2020) derive the confidenceband for DeepHit. However, one cannot apply these ap-proaches to Cox-based models since Cox-based models donot explicitly return a predicted survival time.

3. PreliminaryDenote X ∼ PX ⊂ Rd as the covariate, T ∼ PT ⊂ R asthe survival time, and C ∼ PC ⊂ R as the censoring time.Denote the joint distribution as PXTC , its marginal distribu-tion as PTX , and its conditional distribution as PT |X . Thesurvival time cannot be observed when it is larger than thecensoring time. Therefore, observed time is the minimumof censoring time and survival time. Let Y ∈ R be theobserved time with uncensoring indicator ∆ ∈ R, then

Y = min{T,C}, ∆ = I{T≤C}.

Denote the dataset as Z = {Xi, Yi,∆i}i∈I where we canonly access the covariate, the observed time, and the censor-ing indicator. However, the value of interest is the survivaltime T . Therefore, the censored data is incomplete due tothe information loss when ∆ = 0. We next introduce howcensoring happens, namely, the censoring mechanism.

Censoring Mechanism. In this paper, we consider the cen-soring regimes with strong ignorability assumption, namelyT ⊥∆ | X . We will further discuss the strong ignorabilityassumption in the supplementary materials.

However, note that there can be covariate shift under suchcensoring regimes, namely, the distributions of covariate Xunder censoring and non-censoring are different

(X |∆ = 1)d

6= (X |∆ = 0) .

We refer to Figure 2 for an illustration. Usually, we estimatethe distribution of the survival time via the datasetD by Coxregression. We denote FT (t) as the survival time’s CDF.

Cox Regression. In Cox regression, we focus on two impor-tant terms survival function ST (t) and cumulative hazardfunction ΛT (t), as defined in Equation 1. We emphasizethat they are defined with respect to the survival time Tinstead of the observed time Y .

ST (t) , 1− FT (t), ΛT (t) , − logST (t). (1)

When the context is clear, we omit the subscript T and de-note the above function as F (t), S(t), and Λ(t), respectively.

Covariates

Dens

ity

uncensoredcensored

Figure 2. Covariate shift illustration. We show distributions ofx1 with (blue) and without (green) censoring under simulationdata. Obviously, censored and uncensored data have differentdistributions, i.e. covariate shift1.

Cox-based models usually require proportional hazard as-sumption, formally stated in Assumption 1.

Assumption 1 (Proportional Hazard) For each individ-ual i, we assume

Λi(t;Xi) = Λ0(t) exp (g(Xi)) (2)

where Λ0(t) is the baseline cumulative hazard function, andg(Xi) is the individual effect named as predictor.

Remark: There are several non-proportional hazard Coxmodels which replace g(xi) with g(xi, t) (e.g., Kvammeet al. (2019)). Although our proposed algorithm can bedirectly generalized to non-proportional settings, we onlyconsider proportional hazard models for clarity in this paper.

Specifically, Cox regression solves the case when the pre-dictor g(·) is linear by maximizing partial log-likelihoodl(g), defined in Equation 3.

l(g) ,∑

j:∆j=1

log

∑k∈R(Tj)

exp[g(Xk)− g(Xj)]

, (3)

where R(Tj) is the set of all individuals at risk at time Tj−(the observed time is no less than Tj), and g(Xj) = X>j βis the linear predictor.

For the case when g(·) is not linear, Lee et al. (2018) andKvamme et al. (2019) propose Cox-MLP which uses neuralnetworks to replace the linear predictor g(Xi). Concretely,they use the negative partial log-likelihood as the trainingloss, with a penalty on the complexity of g(·).

Conformal inference. Conformal inference does post-hocestimation based on the splitting of training and calibration

1The simulation dataset is from Kvamme et al. (2019)


fold. One trains a model µ using the training fold and thencalculates the non-conformity score Vi on the calibrationfold. A commonly used non-conformity score is the absoluteerror Vi = |Ti−µ(Xi)|where (Xi, Ti) is the ith calibrationsample. When designing the non-conformity score, one ofthe most important characteristics is exchangeability.

Assumption 2 (Exchangeability) For n ≥ 1 random vari-ables V1, . . . , Vn, they satisfy exchangeability if

(V1, . . . , Vn)d= (Vπ(1), . . . , Vπ(n))

for any permutation π : [n] → [n], where d= means they

have the same distribution, and [n] = {1, 2, . . . , n}.

Exchangeability is weaker than independence since indepen-dence implies exchangeability. Under the exchangeabilityassumption on the non-conformity score, we reach a the-oretical guarantee of the confidence band. In this paper,we mainly focus on two varieties of conformal inference,weighted conformal inference (to deal with covariate shift,Lemma 1) and quantile conformal inference (to return anearly perfectly guarantee, Lemma 2), respectively.

Lemma 1 (WCI, Tibshirani & Foygel (2019) Theorem 2)(Informal.) Assume that the data (Xi, Ti), i ∈ [n+ 1] areweighted exchangeable with weight w1, . . . , wn+1, then thereturned confidence band Cn has guaranteed coverage:

P(Tn+1 ∈ Cn(Xn+1)) ≥ 1− α.

Lemma 2 (QCI, Romano et al. (2019) Theorem 1)(Informal.) Assume that the data (Xi, Ti), i ∈ [n+ 1] areexchangeable, and the non-conformity scores are almostsurely distinct, then the returned confidence band Cn isnearly perfectly calibrated:

1− α ≤ P(Tn+1 ∈ Cn(Xn+1)) ≤ 1− α+1

|Ica|+ 1,

where Ica denotes the number of calibration samples.

Remark. One may wonder why not use S(t) to return theconfidence band in Cox-MLP. People usually do so in prac-tice, but the confidence band has no theoretical guarantee inCox-MLP. To derive the theoretical guarantee under S(t),one needs to show the convergence of the predictor g(Xi).However, it cannot be proved unless the generalization guar-antee of neural networks is obtained.

4. Confidence band for the Survival TimeIn this section, we derive the confidence band for the sur-vival time. We start by analyzing the basic properties andthe critical ideas before proposing the algorithm.

9 8 7 6Nonconformity Score

0.0

0.2

0.4

0.6

0.8

1.0

Dens

ity

Figure 3. Non-conformity score distribution. We show the dis-tribution of the non-conformity score2, which has a well-shapedsingle peak (more stable, see appendix for more details). The darkblue line is the KDE approximation.

The non-conformity score. In traditional conformal infer-ence, the most commonly used non-conformity score is theabsolute form Vi = |Ti − Ti|. That is because we can di-rectly derive the confidence band of Ti based on the bandof the non-conformity score Vi. Unfortunately, it is hardto compute Ti in Cox-based models since they output thesurvival hazard function instead of the survival time, whichrequires a modification of the non-conformity score.

Inspired by the standard Cox regression, we introduce anon-conformity score based on partial likelihood. Specifi-cally, we use the sample partial log-likelihood as the non-conformity score (shown in Equation 4). Compared to theabsolute non-conformity score, the newly proposed non-conformity score Vi = Vg(Xi, Ti) do not need to calculateTi explicitly. Figure 3 illustrates the distribution of thenon-conformity score using a simulation dataset.

Vi = log

∑k∈R(Ti)

exp[g(Xk)− g(Xi)]

. (4)

The incomplete data. Compared to the i.i.d. (independentand identically distributed) data, censored data is incomplete.Notice that we can only calculate the non-conformity scoresof the uncensored data in Equation 4 since we cannot obtainthe exact value of Ti of censored data. As a result, thereturned confidence band is guaranteed under uncensoreddistribution PX|∆=1. However, as mentioned before, theremight be a distribution shift between censored data anduncensored data, leading to the requirement of weightedconformal inference (Tibshirani & Foygel, 2019).

In weighted conformal inference, one needs to calculate the

2The simulation dataset is from Kvamme et al. (2019)


Algorithm 1 WCCI: Weighted Conformal Censoring Inference

Input: Level αInput: data Z = (Xi, Yi,∆i)i∈I , testing point X ′

Input: function w(x;D) to fit the weight function at x using D as data1. Randomly split Z into a training fold Ztr , (Xi, Yi,∆i)i∈Itr and a calibration fold Zca , (Xi, Yi,∆i)i∈Ica2. Use Ztr to train g(·) to estimate the predictor function3. For each i ∈ Ica with ∆i = 1, compute the non-conformity score Vi = log(

∑k∈R(Ti)∩Itr exp[g(Xk)− g(Xi)])

4. For each i ∈ Ica with ∆i = 1, compute the weight Wi = w(Xi;Ztr)5. Calculate the normalized weights pi = Wi∑

i∈Ica Wi+w(X′;Ztr) and p∞ = w(X′;Ztr)∑i∈Ica Wi+w(X′;Ztr)

6. Calculate the (1− α)-th quantile Q1−α of the distribution∑i∈Ica piδVi + p∞δ∞

7. Calculate Tu(X ′) as the smallest value such that its conformity score V ′ (dependent on Tu(X ′)) is larger than Q1−αOutput: C1(X ′) = [0, Tu(X ′)].

weight w based on w(x), where

w(x) =dPX(x)

dPX|∆=1(x)=

P(T = 1)

P(T = 1 | X = x). (5)

Intuitively, w(x) helps transfer the guarantee over PX|∆=1

to PX . In the following of the paper, we use the regularizedweight pi in the algorithm, namely

pi =w(Xi)∑

i∈Ica w(Xi) + w(X ′),

where we denote Xi as the samples from the calibrationfold and X ′ as the testing point.

Exchangeablility. In conformal inference, we assume thatthe non-conformity score satisfies exchangeablility (See As-sumption 2). However, as shown in Equation 4, there isa summation term in the non-conformity score, where weneed to sum up all the samples at risk at time Tj−. Unfortu-nately, it breaks the exchangeablility when we use the at-risksamples in the calibration fold. As an alternative, we usethe at-risk samples in the training fold. The non-conformityscores in the calibration fold Vi, i ∈ [n] then satisfies ex-changeablility given the training fold (See Equation 6).

(V1, . . . , Vn | Dtr)d= (Vπ(1), . . . , Vπ(n) | Dtr), (6)

with arbitrary permutation π.

The reconstruction of confidence band. Based on the non-conformity score proposed in Equation 4, we can reconstructthe confidence band for the survival time. We remark thatwe calculate the one-sided band [0, Tu(X ′)] although thetwo-sided band directly follows. For a new sample X ′, wefirst calculate the (1−α) quantile of (weighted) distributionof the non-conformity score, denoted as Q1−α. We thencalculate the smallest Tu(X ′) that makes its non-conformity

score V ′ = Vg(X′, T ) larger than Q1−α, formally,

Q1−α = Quantile

(1− α,

∑i∈Ica

piVi

)Tu(X ′) = inf {T : Vg(X

′, T ) ≥ Q1−α} .

Algorithm. Based on the discussions above, we concludeour algorithm in Algorithm 1. In training process, we usetraining fold to train the estimated predictor g(X) (definedin Equation 2) and calculate the weight function (defined inEquation 5). We then use calibration fold to calculate thenon-conformity score. We finally construct confidence band[0, Tu(X ′)] for a given testing point X ′.

Robustness on w. The weighted conformal inference hasguaranteed coverage under the true weight w(x) as shownin Lemma 1. However, we can only obtain its estimatorw(x) in practice. In the following Theorem 4.1, we provehow the estimation influences the coverage. Note that aslim|Ztr|→∞ |w(x)− w(x)| → 0, the coverage can be prov-ably larger than 1− α.

Theorem 4.1 (Provable Guarantee) Let w(x) be an esti-mate of the weight w(x). Assume that E[w(X)|Ztr] = 1and E[w(X)] = 1. Denote C1

n(x) as the output band ofAlgorithm 1 with n calibration samples, then for a new dataX ′, its corresponding survival time satisfies

limn→∞

P(T ′ ∈ C1

n(X ′))≥ 1− α− 1

2E|w(X)− w(X)|,

where the probability P on the left hand side is taken over(X ′, T ′) ∼ PX × PT |X , and all the expectation operatorsE are taken over X ∼ PX|∆=1.

Theorem 4.1 proves the lower bound for WCCI. However, itis insufficient to derive the upper bound under the weightedconformal inference framework. To derive the upper bound,we propose the algorithm T-SCI in the next section.


Algorithm 2 T-SCI: Two-Stage Conformal Inference

Input: Level αInput: additional data Zca2 = (Xi, Yi,∆i)i∈Ica2

, testing point X ′

Input: First-stage band [qαlo(X ′;Ztr,Zca1), qαhi

(X ′;Ztr,Zca1)] output from Algorithm 1 3

Input: function w(x;D) to fit the weight function at x using D as data1: For each i ∈ Ica2 with ∆i = 1, compute the non-conformity score Vi = max{qαlo

(Xi;Ztr)−Ti, Ti− qαhi(Xi;Ztr)}

2: For each i ∈ Ica2 with ∆i = 1, compute the weight Wi = w(Xi;Ztr)3: Calculate the normalized weights pi = Wi∑

i∈Ica Wi+w(X′;Ztr) and p∞ = w(X′;Ztr)∑i∈Ica Wi+w(X′;Ztr)

4: Calculate η as the (1− α)-th quantile of the distribution∑i∈Ica piδVi

+ p∞δ∞

Output: C2(X ′) = [qαlo(X ′;Ztr,Zca1)− η, qαhi

(X ′;Ztr,Zca1) + η].

5. T-SCI: Improved EstimationIn the previous section, we propose WCCI, which has lowerbound coverage guarantees. However, for better data effi-ciency, we want not only lower bound but also upper bounds.To reach the goal, we propose a two-stage algorithm T-SCI,which returns nearly perfect coverage.

The intuition is from Lemma 2, stating that the quantileconformal inference (QCI) returns a nearly perfect guar-antee. Inspired by Lemma 2, we first use Algorithm 1 toreturn a temporal confidence band and then apply QuantileConformal Inference to modify this band. We summarizethe whole algorithm in Algorithm 2.

Remark 1. To apply T-SCI in practice, we split the datasetinto Ztr, Zca1 and Zca2. We first apply Algorithm 1 withZtr and Zca1 and return a confidence band. We then docalibration in Algorithm 2 with Zca2.

Remark 2. In Algorithm 2, we modify the conformal scoreto be the absolute form again for clarity, since we havealready derived an interval estimator of T in Algorithm 1.

Note that there are two conformal inference procedures inAlgorithm 2. When the context is clear, the weight functionw(x) and the non-conformity score Vi refers to those in thesecond conformal inference. Before stating the theorem, wedenote H(X) to measure how well the quantile estimatorsqαlo

(X), qαhi(X) are.

H(X) = max{|qαlo(X)−qαlo

(X)|, |qαhi(X)−qαhi

(X)|}(7)

We next show in Theorem 5.1 that Algorithm 2 returnsa guaranteed coverage either the weight or the temporalconfidence band is estimated well.

Theorem 5.1 (Lower Bound) Let w(x) be an estimate ofthe weight w(x), qαlo

(x), qαhi(x) be the quantile estimator

returned by WCCI, and H(X) be defined as Equation 7.Assume that E[w(X)|Ztr] = 1 and E[w(X)] = 1, where allthe expectation operators E are taken over X ∼ PX|∆=1.

3We use the two-sided band here for generality, and the one-sided band directly follows.

Denote C2n(x) as the output band of Algorithm 2 with n

calibration samples, and denote X ′ as the testing point.

From the weight perspective, under assumptions (A1):A1. EX|∆=1|w(X)− w(X)| ≤M1,we have:

limn→∞

P(T ′ ∈ C2

n(X ′))≥ 1− α− 1

2M1.

From the quantile perspective, under assumptions (B1-B3):B1. H(X) ≤M2 a.s. w.r.t. X;B2. There exists δ > 0 such that Ew(X)1+δ <∞;B3. There exists γ, b1, b2 > 0 such that P(T = t|X = x) ∈[b1, b2] uniformly over all (x,t) with t ∈ [qαlo

(x)− 2M2 −2γ, qαlo

(x)+2M2 +2γ]∪ [qαhi(x)−2M2−2γ, qαhi

(x)+2M2 + 2γ],we have:

limn→∞

P(T ′ ∈ C2

n(X′))≥ 1−α−b2(2M2+γ)−

16M2

(M2 + γ)2b1.

Theorem 5.1 demonstrates that when M1 → 0 or M2, γ →0 as |Ztr| → 0, the confidence band has guaranteed cover-age with lower bound 1 − α. Compared to Theorem 4.1,Theorem 5.1 is doubly robust since the coverage is guaran-teed when either (A1) or (B1-B3) holds. We next prove theupper bound in Theorem 5.2.

Theorem 5.2 (Upper Bound) Let w(x) be an estimate ofthe weight w(x), qαlo

(x), qαhi(x) be the quantile estimator

returned by WCCI, and H(X) be defined as Equation 7. As-sume that E[w(X)|Ztr] = 1 and E[w(X)] = 1, where allthe expectation operators E are taken over X ∼ PX|∆=1.Let FV ,

∑i∈Ica piδVi

+ p∞δ∞ be CDF, and assume Vihas no ties. Denote C2

n(x) as the output band of Algorithm 2with n calibration samples, and X ′ as the testing point.Under assumptions (C1-C4):C1. ∀S ⊂ Ica2, |

∑i∈S(w(Xi)− w(Xi))| ≤M ′1;

C2. H(X) ≤M ′2 a.s. w.r.t. X;C3. FV (t+ L)− FV (t) ≥ KL for all t, L;C4. there exists b1, b2 > 0 such that P(T = t|X =x) ∈ [b1, b2] uniformly over all (x,t) with t ∈ [qαlo

(x) −


r, qαlo(x) + r] ∪ [qαhi

(x) − r, qαhi(x) + r], where r =

2M ′2 + 2M ′1/K We have

limn→∞

P(T ′ ∈ C2

n(X ′))≤ 1− α+ b2(2M ′2 +

M ′1K

).

Theorem 5.2 demonstrates that when the weight functionw(x) and the quantile function qαlo

(x), qαhi(x) are esti-

mated well (C1-C2), the returned coverage T-SCI has alower bound guarantee. Combining Theorem 5.1 and Theo-rem 5.2 leads to a “nearly perfect” guaranteed coverage forT-SCI.

6. ExperimentsThis section aims at verifying some key arguments: (1) T-SCI returns valid coverage with small length interval; (2)Weight plays an essential role in the algorithm; (3) Cen-soring is more challenging than uncensoring settings. Theresults support these arguments both in synthetic data andreal-world data, see Table 1.

6.1. Setup

Datasets and Environment. We conduct extensive exper-iments to test the efficiency of the algorithm. We use onesynthetic dataset RRNLNPH (from Kvamme et al. (2019))and two real-world datasets, METABRIC and SUPPORT(See supplementary materials for more details). For eachdataset, we test several baseline algorithms along with ourproposed algorithms. In each run, 80% data are randomlysampled as the training data, and the two halves of the restare randomly split as calibration data and test data. We runthe experiments 100 times for each algorithm. Moreover,we collect the results under different αs.

Algorithms. We choose the linear Cox regression (labeledas Cox Reg.) as a baseline. Besides, we conduct CoxPH(Katzman et al., 2018) and CoxCC (Kvamme et al., 2019)(both belong to Cox-MLP) using S(t) to return confidenceband although they do not contain a theoretical guarantee.We also choose a kernel-based non-Cox method, (Chen,2020) (labeled as Kernel). Besides, we conduct RSF (Ish-waran et al., 2008) (labeled as Random Survival Forest),Nnet-Survival (Gensheimer & Narasimhan, 2019) (labeledas Nnet-Survival), and MTLR (Yu et al., 2011) (labeled asMTLR) as benchmarks. We emphasize that these methodsare different approaches since our method is Cox-based.

We integrate our proposed algorithms WCCI and T-SCI withCoxPH and CoxCC, labeled as CoxPH/CoxCC + WCCI/T-SCI. Besides, to ensure that the weights in WCCI and T-SCIare helpful, we test the unweighted version of WCCI and T-SCI, where we set all weights to 1 in WCCI and T-SCI. Wesummarize all the experimental results under significancelevel α = 95% in Table 1. Ideally, a perfect method returns

CoxPH+WCCI CoxPH+T-SCI CoxCC+WCCI CoxCC+T-SCIAlgorithm

0.88

0.9

0.92

0.940.950.96

0.98

Cove

rage

weightedunweighted

Figure 4. Weight Rationality. We compare the weighted version(green) with its corresponding unweighted version (yellow). Theweighted versions contain less bias for WCCI (closer to the ex-pected 95%) and less variance for T-SCI (shorter boxes).


0.75

0.80

0.85

0.90

0.95

1.00

Cove

rage

censoreduncensored

Figure 5. Censoring comparison. We compare model perfor-mances on censored (green) and uncensored (uncensored) dataseparately. All the algorithms show a larger coverage and lessvariance on censored data.

coverage slightly larger to 95% with small interval lengthwhile being balanced in censored and uncensored data.

Metrics. In the synthetic dataset, our core metric is empiri-cal coverage (EC), defined as the fraction of testing pointswhose survival time falls in the predicted confidence band.Besides, we calculate the average interval length of the re-turned band. Given confidence level α, a perfect confidenceband is expected to return empirical coverage larger than1− α with a small interval length. In the real-world dataset,we use surrogate empirical coverage (SEC) which is theupper bound of EC. We defer its formal definition in thesupplementary materials.

6.2. Analysis

We summarize the experimental results on RRNLNPH inTable 1 and defer the results on SUPPORT and METABRICto the supplementary materials due to space limitations.


Table 1. Performance of Different Models on RRNLNPH.Method Total Censored Uncensored Interval Length

Mean Std. Mean Std. Mean Std. Mean Std.Cox Reg. 0.832 0.008 / / / / 22.08 0.23

Random Survival Forest (Ishwaran et al., 2008) 0.948 0.006 / / / / 16.39 0.22Nnet-Survival (Gensheimer & Narasimhan, 2019) 0.834 0.007 0.560 0.016 0.982 0.005 20.11 0.37

MTLR (Yu et al., 2011) 0.830 0.008 0.554 0.017 0.980 0.003 19.85 0.34CoxPH (Katzman et al., 2018) 0.829 0.008 0.554 0.016 0.978 0.004 19,65 0.21CoxCC (Kvamme et al., 2019) 0.830 0.008 0.556 0.016 0.975 0.003 20.17 0.24

CoxPH+WCCI 0.912 0.03 0.854 0.047 0.949 0.028 21.59 0.90CoxPH+T-SCI 0.974 0.009 0.947 0.018 0.994 0.005 29.85 0.68CoxCC+WCCI 0.919 0.03 0.862 0.043 0.955 0.026 21.55 1.27CoxCC+T-SCI 0.974 0.009 0.946 0.017 0.995 0.004 29.62 0.57

CoxPH+WCCI(unweigted) 0.907 0.020 0.830 0.049 0.949 0.006 22.06 1.27CoxPH+T-SCI(unweighted) 0.950 0.018 0.875 0.049 0.990 0.012 27.72 3.13CoxCC+WCCI(unweigted) 0.941 0.029 0.815 0.029 0.948 0.009 22.57 0.70CoxCC+T-SCI(unweighted) 0.955 0.020 0.877 0.048 0.992 0.007 28.92 1.04

Kernel (Chen, 2020) 0.951 0.024 0.858 0.093 0.993 0.014 51.63 32.92

Coverage and Interval Length. Figure 6 shows the em-pirical coverage and the interval length of the confidenceband returned by algorithms under different confidence lev-els. Ideally, the confidence band should have large coveragewith a small interval length. Notice that WCCI and T-SCIbased algorithms outperform their original versions, show-ing that the proposed algorithms work well. Furthermore,T-SCI is more conservative than WCCI, where T-SCI haslarger coverage and interval length. We further remark thatT-SCI returns guaranteed coverage (larger than 1−α) underdifferent confidence levels.

The rationality of weight w(x). We show the comparisonbetween weighted and unweighted versions in Figure 4. ForWCCI algorithms, the weighted version has coverage closerto 1 − α. For T-SCI algorithms, the weighted version haslower variance. These results show that weighted versionsoutperform the unweighted versions, which validates theimportance of weight w(x).

The difficulty in censoring. Figure 5 shows the algorithms’performance on censored and uncensored data, respectively.All the algorithms perform larger coverage and less varianceon the uncensored data, meaning that censored data is morechallenging to deal with than uncensored data. Besides, weemphasize that although the unweighted versions may becloser to 95% in some cases, they lack theoretical guaran-tees and are imbalanced, meaning that it performs prettydifferently on censored and uncensored data.

Analysis of Table 1. We show RRNLNPH results in Ta-ble 1, and results of SUPPORT and METABRIC performsimilarly. Firstly, notice that WCCI does not reach the ex-pected coverage mainly due to the inaccurate estimation onthe weight, while T-SCI reaches it due to milder require-ments in Theorem 5.1 (double robustness). Secondly, notice

that Cox Reg., CoxPH, CoxCC (they all lack theoreticalguarantee) all fail to return the proper coverage. Thirdly,unweighted versions all suffer from poor performances oncensored data due to the lack of weight, showing that weightis vital in covariate shift. Finally, we emphasize that theKernel method suffers from large interval length and largevariance despite the moderate coverage. As a comparison,WCCI and T-SCI often perform more stably.

7. ConclusionIn this paper, we derive confidence band for Cox-basedmodels. We first introduce WCCI by proposing a new non-conformity score. We then propose T-SCI, a two-stageconformal inference applying WCCI as input. Theoreticalanalysis shows that T-SCI returns nearly perfect coverage,meaning both lower and upper bound guarantee. We conductextensive experiments on both synthetic data and real-worlddata to show the proposed algorithm’s correctness.


0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95Confidence Level

0.6

0.7

0.8

0.9

1.0

Cove

rage

AlgorithmCoxPH+WCCICoxPH+T-SCICoxCC+WCCICoxCC+T-SCICoxPHCoxCC

(a) Coverage Comparison(weighted)


0.5

0.6

0.7

0.8

0.9

Cove

rage

AlgorithmCoxPH+WCCI (unweighted)CoxPH+T-SCI (unweighted)CoxCC+WCCI (unweighted)CoxCC+T-SCI (unweighted)

(b) Coverage Comparison(unweighted)


10

15

20

25

30

Leng

ht


(c) Interval Length Comparison(weighted)


5

10

15

20

25

30

Leng

ht


(d) Interval Length Comparison(unweighted)

Figure 6. Comparison under different confidence level (1− α). CoxCC+T-SCI (purple) and CoxPH+T-SCI (grey) returns guaranteedcoverage under different confidence level without much increase of interval length (CoxCC+T-SCI and CoxPH+T-SCI overlap).


ReferencesAkritas, M. G., Murphy, S. A., and Lavalley, M. P. The

theil-sen estimator with doubly censored data and appli-cations to astronomy. Journal of the American StatisticalAssociation, 90(429):170–177, 1995.

Barber, R. F., Candes, E. J., Ramdas, A., and Tibshirani,R. J. Conformal prediction under covariate shift. arXivpreprint arXiv:1904.06019, 2019a.

Barber, R. F., Candes, E. J., Ramdas, A., and Tibshirani,R. J. The limits of distribution-free conditional predictiveinference. arXiv preprint arXiv:1903.04684, 2019b.

Bellotti, T. and Crook, J. Credit scoring with macroeco-nomic variables using survival analysis. Journal of theOperational Research Society, 60(12):1699–1707, 2009.

Berrett, T. B., Wang, Y., Barber, R. F., and Samworth, R. J.The conditional permutation test for independence whilecontrolling for confounders. Journal of the Royal Statis-tical Society: Series B (Statistical Methodology), 82(1):175–197, 2020.

Bostr, H., Asker, L., Gurung, R., Karlsson, I., Lindgren, T.,Papapetrou, P., et al. Conformal prediction using randomsurvival forests. In 2017 16th IEEE International Con-ference on Machine Learning and Applications (ICMLA),pp. 812–817. IEEE, 2017.

Bostrom, H., Johansson, U., and Vesterberg, A. Predictingwith confidence from survival data. In Conformal andProbabilistic Prediction and Applications, pp. 123–141,2019.

Breslow, N. E. Analysis of survival data under the pro-portional hazards model. International Statistical Re-view/Revue Internationale de Statistique, pp. 45–57,1975.

Candes, E. J., Lei, L., and Ren, Z. Conformalized survivalanalysis. arXiv preprint arXiv:2103.09763, 2021.

Chen, G. H. Deep kernel survival analysis and subject-specific survival time prediction intervals. In Ma-chine Learning for Healthcare Conference, pp. 537–565.PMLR, 2020.

Cox, D. R. Regression models and life-tables. Journal ofthe Royal Statistical Society: Series B (Methodological),34(2):187–202, 1972.

Feldmann, R. Leo-py: Estimating likelihoods for correlated,censored, and uncertain data with given marginal distri-butions. Astronomy and Computing, 29:100331, 2019.

Gensheimer, M. F. and Narasimhan, B. A scalable discrete-time survival model for neural networks. PeerJ, 7:e6257,2019.

Ishwaran, H., Kogalur, U. B., Blackstone, E. H., Lauer,M. S., et al. Random survival forests. Annals of AppliedStatistics, 2(3):841–860, 2008.

Kaplan, E. L. and Meier, P. Nonparametric estimation fromincomplete observations. Journal of the American statis-tical association, 53(282):457–481, 1958.

Katzman, J. L., Shaham, U., Cloninger, A., Bates, J., Jiang,T., and Kluger, Y. Deepsurv: personalized treatment rec-ommender system using a cox proportional hazards deepneural network. BMC medical research methodology, 18(1):1–12, 2018.

Klein, J. P. and Moeschberger, M. L. Survival analysis:techniques for censored and truncated data. SpringerScience & Business Media, 2006.

Kvamme, H., Borgan, Ø., and Scheel, I. Time-to-event pre-diction with neural networks and cox regression. Journalof machine learning research, 20(129):1–30, 2019.

Lee, C., Zame, W., Yoon, J., and van der Schaar, M. Deep-hit: A deep learning approach to survival analysis withcompeting risks. In Proceedings of the AAAI Conferenceon Artificial Intelligence, volume 32, 2018.

Lei, J. and Wasserman, L. Distribution-free prediction bandsfor non-parametric regression. Journal of the Royal Sta-tistical Society: Series B: Statistical Methodology, pp.71–96, 2014.

Lei, J., Robins, J., and Wasserman, L. Distribution-freeprediction sets. Journal of the American Statistical Asso-ciation, 108(501):278–287, 2013.

Lei, J., G’Sell, M., Rinaldo, A., Tibshirani, R. J., andWasserman, L. Distribution-free predictive inference forregression. Journal of the American Statistical Associa-tion, 113(523):1094–1111, 2018.

Lei, L. and Candes, E. J. Conformal inference of counter-factuals and individual treatment effects. arXiv preprintarXiv:2006.06138, 2020.

Li, A. H. and Bradic, J. Censored quantile regression forest.In International Conference on Artificial Intelligence andStatistics, pp. 2109–2119. PMLR, 2020.

Nakagawa, S. and Freckleton, R. P. Missing inaction: thedangers of ignoring missing data. Trends in ecology &evolution, 23(11):592–596, 2008.

Nouretdinov, I., Costafreda, S. G., Gammerman, A., Chervo-nenkis, A., Vovk, V., Vapnik, V., and Fu, C. H. Machinelearning classification with confidence: application oftransductive conformal predictors to mri-based diagnos-tic and prognostic markers in depression. Neuroimage,56(2):809–813, 2011.


Robins, J. M. and Finkelstein, D. M. Correcting for noncom-pliance and dependent censoring in an aids clinical trialwith inverse probability of censoring weighted (ipcw)log-rank tests. Biometrics, 56(3):779–788, 2000.

Romano, Y., Patterson, E., and Candes, E. J. Conformalizedquantile regression. arXiv preprint arXiv:1905.03222,2019.

Romano, Y., Sesia, M., and Candes, E. J. Classifica-tion with valid and adaptive coverage. arXiv preprintarXiv:2006.02544, 2020.

Sadinle, M., Lei, J., and Wasserman, L. Least ambiguousset-valued classifiers with bounded error levels. Journalof the American Statistical Association, 114(525):223–234, 2019.

Sargent, D. J. Comparison of artificial neural networks withother statistical approaches: results from medical datasets. Cancer: Interdisciplinary International Journal ofthe American Cancer Society, 91(S8):1636–1642, 2001.

Shafer, G. and Vovk, V. A tutorial on conformal prediction.Journal of Machine Learning Research, 9(Mar):371–421,2008.

Tibshirani, R. and Foygel, R. Conformal prediction undercovariate shift. Advances in neural information process-ing systems, 2019.

Vovk, V., Gammerman, A., and Shafer, G. Algorithmiclearning in a random world. Springer Science & BusinessMedia, 2005.

Wang, P., Li, Y., and Reddy, C. K. Machine learning forsurvival analysis: A survey. ACM Computing Surveys(CSUR), 51(6):1–36, 2019.

Wei, L.-J. The accelerated failure time model: a useful al-ternative to the cox regression model in survival analysis.Statistics in medicine, 11(14-15):1871–1879, 1992.

Xiang, A., Lapuerta, P., Ryutov, A., Buckley, J., and Azen,S. Comparison of the performance of neural networkmethods and cox regression for censored survival data.Computational statistics & data analysis, 34(2):243–257,2000.

Yu, C.-N., Greiner, R., Lin, H.-C., and Baracos, V. Learn-ing patient-specific cancer survival distributions as a se-quence of dependent regressors. Advances in NeuralInformation Processing Systems, 24:1845–1853, 2011.

Zhu, R. and Kosorok, M. R. Recursively imputed survivaltrees. Journal of the American Statistical Association,107(497):331–340, 2012.


Supplementary MaterialsWe complement the omitted proofs in Section A. We then provide the additional experimental results in Section B.Furthermore, in Section C, we give supplementary notes for some statements in the main text.

A. ProofsA.1. Proof of Theorem 4.1

Theorem 4.1 (Provable Guarantee) Let w(x) be an estimate of the weight w(x). Assume that E[w(X)|Ztr] = 1 andE[w(X)] = 1. Denote C1

n(x) as the output band of Algorithm 1 with n calibration samples, then for a new data X ′, itscorresponding survival time satisfies

limn→∞

P(T ′ ∈ C1

n(X ′))≥ 1− α− 1

2E|w(X)− w(X)|,

where the probability P on the left hand side is taken over (X ′, T ′) ∼ PX × PT |X , and all the expectation operators E aretaken over X ∼ PX|∆=1.

Firstly, consider a new sample(X ′, T ′

)generated from PX × PT |X , where we assume that

dPX (x) = w (x) dPX|∆=1 (x) .

As a comparison, due to the definition of w (x), we have

dPX (x) = w (x) dPX|∆=1 (x) .

We remark that PX (x), PX (x) are indeed distribution since we assume Ew (X) = Ew (X) = 1, where the expectation istaken over PX|∆=1.

We would first prove that the probability T ′ falls in the derived confidence interval C1n (X ′) is larger than 1− α, where α is

the given significance level. The intuition is that, since the derived C1n (X ′) is derived based on the estimated weight w (x),

the sample T ′ is then guaranteed to fall in the confidence interval with probability at least 1− α.

We derive that

P(T ′ ∈ C1

n

(X ′)|Ztr

)=P(T ′ ≤ Tu

(X ′)|Ztr

)=P(V(X ′, T ′

)≤ V

(X ′, Tu

(X ′)|Ztr

))≥P

(V(X ′, T ′

)≤ Quantile

(1− α;

n∑i=1

piδvi + p∞δ∞

)|Ztr

),

(8)

where the last inequality is due to the fact that V(X ′, Tu

(X ′)≥ Quantile

)(1− α;

∑ni=1 piδvi + p∞δ∞). We also use

the non-decreasing property of V (X,T ) on T. Furthermore, by Lemma 3, we can replace the δ∞ in the quantile term byδV (X′,T ′), therefore,

P

(V(X ′, T ′

)≤ Quantile

(1− α;

n∑i=1

piδvi + p∞δ∞

)|Ztr

)

=P

(V(X ′, T ′

)≤ Quantile

(1− α;

n∑i=1

piδvi + p∞δV (X′,T ′)

)|Ztr

).

(9)

Besides, we know from Lemma 4 that

V(X ′, T ′

)|Ztr, E (V ) ∼

n∑i=1

piδvi + p∞δV (X′,T ′),


therefore, we derive that

P

(V(X ′, T ′

)≤ Quantile

(1− α;

n∑i=1


)|Ztr

)

=EEP

(V(X ′, T ′

)≤ Quantile

(1− α;

n∑i=1


)|Ztr, E (V )

)≥1− α.

(10)

Combining the Equation 8, Equation 9 and Equation 10 leads to

P(T ′ ∈ C1

n

(X ′)|Ztr

)≥ 1− α. (11)

Furthermore, we need to transform the above results into random sample (X ′, T ′) ∼ dPX × dPT |X . This directly followsLemma 5 that∣∣∣P(T ′ ∈ C1

n (X ′) |Ztr)− P

(T ′ ∈ C1

n (X ′) |Ztr)∣∣∣ ≤ dTV (PX × PT |X , PX × PT |X) = dTV

(PX , PX

).

We can express the total-variation distance between QX and QX as

dTV

(PX , PX

)=

1

2

∫|w (X) dPX|∆=1 (X)− w (X) dPX|∆=1 (x) | = 1

2EX∼PX|∆=1

|w (X)− w (X) |.

Combine the above results and take expectations on the training set, we conclude that

P(T ′ ∈ C1

n (X ′))

=EZtrP(T ′ ∈ C1

n (X ′) |Ztr)

≥1− α− 1

2E|w (X)− w (X) |.

(12)

where the expectation is taken over the training set space and X ∼ PX|∆=1

Technical Lemmas. In this part, we give some technical lemmas used in the proof. We first introduce Lemma 3 which iscommonly used in conformal inference. By Lemma 3, we can use δ∞ to replace δV (X′,T ′) without changing the probability.

Lemma 3 (Equation(2) in Lemma 1 from Barber et al. (2019a).) For random variables vi ∈ R, i ∈ [n + 1], let pi ∈R, i ∈ [n+ 1] be the corresponding weights summing to 1. Then for any β ∈ [0, 1], we have

V(X ′, T ′

)≤ Quantile

(β,

n∑i=1

piδvi + p∞δ∞

)⇐⇒ V

(X ′, T ′

)≤ Quantile

(β,

n∑i=1


). (13)

We next introduce Lemma 4 which provides the distribution of the non-conformity score.

Lemma 4 (Equation(A.5) from Lei & Candes (2020).)(V(X ′, T ′

)|E (V ) = E (V ∗) ,Ztr

)∼

n∑i=1

piδv∗i + p∞δV (X′,T ′), (14)

where E (V ∗) is the unordered set of V ∗ =(v∗1 , v

∗2 , . . . , v

∗n, V

(X ′, T ′

)).

Thirdly, we introduce Lemma 5 which shows a basic property of the total variance distance.

Lemma 5 (Equation(10) from Berrett et al. (2020).) Let dTV (Q1X , Q2X) denote the total-variation distance betweenQ1X and Q2X , then

dTV(Q1X × PT |X , Q2X × PT |X

)= dTV (Q1X , Q2X) . (15)


A.2. Proof of Theorem 5.1

Theorem 5.1 (Lower Bound) Let w(x) be an estimate of the weight w(x), qαlo(x), qαhi

(x) be the quantile estimatorreturned by WCCI, and H(X) be defined as Equation 7. Assume that E[w(X)|Ztr] = 1 and E[w(X)] = 1, where all theexpectation operators E are taken over X ∼ PX|∆=1. Denote C2

n(x) as the output band of Algorithm 2 with n calibrationsamples, and denote X ′ as the testing point.


limn→∞

P(T ′ ∈ C2

n(X ′))≥ 1− α− 1

2M1.

From the quantile perspective, under assumptions (B1-B3):B1. H(X) ≤M2 a.s. w.r.t. X;B2. There exists δ > 0 such that Ew(X)1+δ <∞;B3. There exists γ, b1, b2 > 0 such that P(T = t|X = x) ∈ [b1, b2] uniformly over all (x,t) with t ∈ [qαlo

(x) − 2M2 −2γ, qαlo

(x) + 2M2 + 2γ] ∪ [qαhi(x)− 2M2 − 2γ, qαhi

(x) + 2M2 + 2γ],we have:

limn→∞

P(T ′ ∈ C2

n(X′))≥ 1− α− b2(2M2 + γ)− 16M2

(M2 + γ)2b1.

The proof under (A1) directly follows the proof of Theorem 4.1. Firstly, for a new testing point (X ′, T ′) ∼ PX ×PT |X , wehave

P(T ′ ∈ C2

n (X ′) |X ′)

=P (T ′ ∈ [qαlo(X ′)− η, qαhi

(X ′) + η]|X ′)=P (max{T ′ − qαhi

(X ′) , qαlo(X ′)− T ′} ≤ η|X)

i≥P (max{T ′ − qαhi

(X ′) , qαlo(X ′)− T ′} ≤ η −H (X ′) |X ′)

ii≥P (max{T ′ − qαhi

(X ′) , qαlo(X ′)− T ′} ≤ −ε−H (X ′) |X ′)− P (η < −ε)

iii≥P (max{T ′ − qαhi

(X ′) , qαlo(X ′)− T ′} ≤ −ε−H (X ′) I (H (X ′) ≤ ε) |X ′)− I (H (X ′) > ε)− P (η < −ε) .

(16)

Equation (i) follows from Lemma 6, and Equation (ii) follows from:

P (max{T ′ − qαhi(X ′) , qαlo

(X ′)− T ′} ≤ −ε−H (X ′) |X ′)− P (max{T ′ − qαhi(X ′) , qαlo

(X ′)− T ′} ≤ η −H (X ′) |X ′)≤P (η −H (X ′) < max{T ′ − qαhi

(X ′) , qαlo(X ′)− T ′} ≤ −ε−H (X ′) |X ′)

≤P (η −H (X ′) < −ε−H (X ′) |X ′)=P (η < −ε) .

Equation (iii) can be derived simply based on the discussion on the value of I (H (X ′) ≤ ε).

By Assumption (B3), since −ε−H (X ′) I (H (X ′) ≤ ε) ≤ −2ε, when ε ≤M2 + γ, we have


(X ′)− T ′} ≤ −ε−H (X ′) I (H (X ′) ≤ ε) |X ′)≥P (max{T ′ − qαhi

(X ′) , qαlo(X ′)− T ′} ≤ 0|X ′)− b2 (ε+H (X ′) I (H (X ′) ≤ ε))

≥P (max{T ′ − qαhi(X ′) , qαlo

(X ′)− T ′} ≤ 0|X ′)− b2 (ε+H (X ′))

≥1− α− b2 (ε+H (X ′)) .

(17)

Combining Eqn 16 with Eqn 17 and taking expectations over X ′, we have:

P(T ′ ∈ C2

n (X ′))

≥1− α− b2 (ε+ EH (X ′))− P (H (X ′) > ε)− P (η < −ε)(i)

≥1− α− b2 (2M2 + γ)− P (η < − (M2 + γ)) .

(18)


The Equation (i) holds by taking ε = M2 + γ. Due to Assumption (B1), we have EH (X ′) ≤M2 and P (H (X ′) > ε) = 0.Note that ε = M2 + γ does not break the condition of Equation 17.

We next show that limn→∞ P (η < − (M2 + γ)) ≤ 16M2

(M2+γ)2b1. Firstly, by Lemma 7, we have

limn→∞

P (η < − (M2 + γ)) ≤ 16

(M2 + γ)2b1E[w (X)H (X)] ≤ 16M2

(M2 + γ)2b1E[w (X)] =

16M2

(M2 + γ)2b1.

The inequality is due to Assumption (B1) and E[w (X)] = 1.

The proof is done.

Technical Lemmas. In this part, we give some technical lemmas used in the proof. The following Lemma 6 shows thatthe difference between non-conformity score under true quantile q and non-conformity score under estimated quantile q isupper bounded by H (X ′).

Lemma 6 Under notations in Theorem 5.1, we have |max{T ′ − qαhi(X ′) , qαlo

(X ′)− T ′} −max{T ′ − qαhi(X ′) , qαlo

(X ′)− T ′}| ≤H (X ′)

Proof. We investigate the results via situations on the operator max.

If max{T ′ − qαhi(X ′) , qαlo

(X ′)− T ′} = T ′ − qαhi(X ′) and max{T ′ − qαhi

(X ′) , qαlo(X ′)− T ′} = T ′ − qαhi

(X ′),the conclusion follows the definition of H (X ′).

If max{T ′ − qαhi(X ′) , qαlo

(X ′)− T ′} = T ′ − qαhi(X ′) and max{T ′ − qαhi

(X ′) , qαlo(X ′)− T ′} = qαlo

(X ′)− T ′,we have

max{T ′ − qαhi(X ′) , qαlo


(X ′)− T ′}= (T ′ − qαhi

(X ′))− (qαlo(X ′)− T ′)

≤ (T ′ − qαhi(X ′))− (T ′ − qαhi

(X ′))

=qαhi(X ′)− qαhi

(X ′)

≤H (X ′) .

Similarly,

max{T ′ − qαhi(X ′) , qαlo


(X ′)− T ′}= (T ′ − qαhi

(X ′))− (qαlo(X ′)− T ′)

≥ (qαlo(X ′)− T ′)− (qαlo

(X ′)− T ′)=qαlo

(X ′)− qαlo(X ′)

≥−H (X ′) .

Therefore, we have

|max{T ′ − qαhi(X ′) , qαlo


(X ′)− T ′}| ≤ H (X ′) .

The other two situations are derived similarly.

We next introduce Lemma 7 which provides the upper bounds of the term limn→∞ P (η < −ε).

Lemma 7 Under the assumptions in Theorem 5.1, the following inequality holds.

limn→∞

P (η < −ε) ≤ 16

ε2b1E[w (X)H (X)].

Proof. By combining Equation (A.10), (A.11), (A.12), (A.13) in Lei & Candes (2020), and apply the Assumption (B2) wehave

limn→∞

P (η < −ε) ≤ P

(n∑i=1

w (Xi)H (Xi) ≥ε2b1n

16

).


And by Markov inequality, we have

P

(n∑i=1

w (Xi)H (Xi) ≥ε2b1n

16

)≤ 16

ε2b1n

n∑i=1

E[w (Xi)H (Xi)] =16

ε2b1E[w (X)H (X)].

The proof is done.

A.3. Proof of Theorem 5.2

Theorem 5.1 (Lower Bound) Let w(x) be an estimate of the weight w(x), qαlo(x), qαhi

(x) be the quantile estimatorreturned by WCCI, and H(X) be defined as Equation 7. Assume that E[w(X)|Ztr] = 1 and E[w(X)] = 1, where all theexpectation operators E are taken over X ∼ PX|∆=1. Denote C2

n(x) as the output band of Algorithm 2 with n calibrationsamples, and denote X ′ as the testing point.


limn→∞

P(T ′ ∈ C2

n(X ′))≥ 1− α− 1

2M1.

From the quantile perspective, under assumptions (B1-B3):B1. H(X) ≤M2 a.s. w.r.t. X;B2. There exists δ > 0 such that Ew(X)1+δ <∞;B3. There exists γ, b1, b2 > 0 such that P(T = t|X = x) ∈ [b1, b2] uniformly over all (x,t) with t ∈ [qαlo

(x) − 2M2 −2γ, qαlo

(x) + 2M2 + 2γ] ∪ [qαhi(x)− 2M2 − 2γ, qαhi

(x) + 2M2 + 2γ],we have:

limn→∞

P(T ′ ∈ C2

n(X′))≥ 1− α− b2(2M2 + γ)− 16M2

(M2 + γ)2b1.

We now show the proof of Theorem 5.2. For a new testing point (X ′, T ′) ∼ PX × PT |X , we have

P(T ′ ∈ C2

n (X ′) |X ′)

=P (T ′ ∈ [qαlo(X ′)− η, qαhi

(X ′) + η]|X ′)=P (max{T ′ − qαhi

(X ′) , qαlo(X ′)− T ′} ≤ η|X)

i≤P (max{T ′ − qαhi

(X ′) , qαlo(X ′)− T ′} ≤ η +H (X ′) |X ′)

ii≤P (max{T ′ − qαhi

(X ′) , qαlo(X ′)− T ′} ≤ ηq,w + ε+H (X ′) |X ′) + P (η − ηq,w > ε)

iii≤P (max{T ′ − qαhi

(X ′) , qαlo(X ′)− T ′} ≤ ηq,w + ε+H (X ′) I (H (X ′) ≤ ε) |X ′) + I (H (X ′) > ε) + P (η − ηq,w > ε) .

(19)

where we denote ηq,w = Quantile(1− α;

∑ni=1 piδV ∗i + p∞δV ∗∞

), and obviously, ηq,w = 0 by the definition of the

non-conformity score, where V ∗i is calculated based on qαlo(·) and qαhi

(·).

Equation (i) follows from Lemma 6, and Equation (ii) follows from:


(X ′)− T ′} ≤ η +H (X ′) |X ′)− P (max{T ′ − qαhi(X ′) , qαlo

(X ′)− T ′} ≤ ηq,w + ε+H (X ′) |X ′)≤P (ηq,w + ε+H (X ′) < max{T ′ − qαhi

(X ′) , qαlo(X ′)− T ′} ≤ η +H (X ′) |X ′)

≤P (ηq,w + ε+H (X ′) < η +H (X ′) |X ′)=P (η − ηq,w > ε) .

Equation (iii) can be derived simply based on the discussion on the value of I (H (X ′) ≤ ε).


By Assumption (C4), since ε+H (X ′) I (H (X ′) ≤ ε) ≤ 2ε, when ε ≤M2 + γ, we have

limn→∞


(X ′)− T ′} ≤ ηq,w + ε+H (X ′) I (H (X ′) ≤ ε) |X ′)

≤ limn→∞


(X ′)− T ′} ≤ ηq,w|X ′) + b2 (ε+H (X ′) I (H (X ′) ≤ ε))

≤ limn→∞


(X ′)− T ′} ≤ 0|X ′) + b2 (ε+H (X ′))

≤1− α+ b2 (ε+H (X ′)) .

(20)

The last inequality is from Lemma 2. Combining Eqn 19 with Eqn 20 and taking expectations over X ′, we have:

limn→∞

P(T ′ ∈ C2

n (X ′))

≤1− α+ b2 (ε+ EH (X ′)) + P (H (X ′) > ε) + limn→∞

P (η − ηq,w > ε) .

By taking ε = M ′2 +M ′1/K, and plugging in Assumption (B1), the above equation is equivalent to

limn→∞

P(T ′ ∈ C2

n (X ′))≤ 1− α+ b2 (2M ′2 +M ′1/K) + lim

n→∞P (η − ηq,w > M ′2 +M ′1/K) . (21)

The left is to show thatlimn→∞

P (η − ηq,w > M ′2 +M ′1/K) = 0. (22)

First we denote ηq = Quantile(1− α;

∑ni=1 piδV ∗i + p∞δV ∗∞

)where pi is the estimator of pi. Equation 22 holds by

applying Lemma 8 and Lemma 9.

The proof is done.

Technical Lemmas. In this part, we give some technical lemmas used in the proof. We first prove the following Lemma 8which gives an upper bound of |η − ηq|.

Lemma 8 Under assumptions of Theorem 5.2, |η − ηq| ≤M ′2

Proof. Notice that η is the quantile of distribution∑ni=1 piδVi

+ p∞δV∞ and ηq is the quantile of distribution∑ni=1 piδV ∗i +

p∞δV ∗∞ , where Vi is calculated based on qαhi, qαlo

.

When η = ηq , the conclusion directly follows. When η 6= ηq , notice that the two distribution share the same weight w, theremust exist a V ′ and V ′′ such that

V ′∗ ≤ η, V ′ ≥ ηq

V ′′∗ ≥ η, V ′′ ≤ ηq.

This leads to the following two inequalities by Assumption (C2) and Lemma 7.

η − ηq ≥ V ′∗ − V ′ ≥ −H (X) ≥ −M ′2

η − ηq ≤ V ′′∗ − V ′′ ≤ H (X) ≤M ′2

(23)

Therefore,|η − ηq| ≤M ′2.

The proof is done.

We next prove Lemma 9 which provides an upper bound of |ηq − ηq,w|.

Lemma 9 Under assumptions of Theorem 5.2, |ηq − ηq,w| ≤M ′1/K

Proof. WLOG, assume ηq,w − ηq = ∆ > 0. Denote F (·) =∑ni=1 piδV ∗i + p∞δV ∗∞ and G (·) =

∑ni=1 piδV ∗i + p∞δV ∗∞ .

By the definition of ηq, ηq,w, we haveF (ηq,w) = G (ηq) = 1− α.


Therefore, on the one hand, by Assumption (C1) and Assumption Ew (X) = Ew (X) = 1

F (ηq,w)− F (ηq)

=G (ηq)− F (ηq)

≤ suptG (ηq)− F (ηq)

≤ supS|∑i∈S

w (Xi)− w (xI) |

≤M ′1.

(24)

On the other hand, by Assumption (C3)

F (ηq,w)− F (ηq)

=F (ηq + ∆)− F (ηq)

≥K∆.

(25)

Combining Equation 24 and Equation 25 leads to

ηq,w − ηq ≤M ′1/K.

The conclusion directly follows.

Combining Lemma 8 and Lemma 9 leads to the upper bound of |η − ηq,w|.

B. Supplemental Experiment ResultsAll codes are available at https://github.com/thutzr/Cox.

B.1. Experiment Process

Data Pre-processing. There are both numerical features and categorical features in our datasets. We normalize numericalfeatures on both training set and test set.

Process. In each run, the experiment runs as follows:

• We first randomly split 80% data as the training set while the rest is splitted randomly into calibration set and test set.

• Following (Chen, 2020), we randomly sample 100 data points, which are used to construct prediction intervals, fromtest set. Denote these points as Xcenters.

• For each point x0 ∈ Xcenters:

– We sample 100 data points in test set with respect to sampling probability proportional to K(x, x0), where K(·)is the Gaussian kernel.

– For these 100 test points, we use algorithm 1 and 2 to calculate the predicted survival interval with respect to thegiven confidence level α and check if the true survival time of each point is included in the predicted interval.The fraction of points that are covered in the calculated confidence interval is the empirical coverage. And thedifference of upper confidence band and its lower counterpart is interval length. In our experiments, the upperinterval band is likely to be infinity sometimes. We truncate those upper bands to the maximum duration of theaccording dataset.

We run the above procedure for different confidence level αs. Results show that our algorithms are robust and effective fordifferent αs. In our experiments, α is chosen to be 0.6, 0.7, 0.8, 0.9, 0.95,respectively.

https://github.com/thutzr/Cox


Table 2. Model Comparison on SUPPORTMethod Total Censored Uncensored Interval Length



MTLR (Yu et al., 2011) 0.994 0.003 0.986 0.006 0.998 0.002 1921.65 37.69CoxPH (Katzman et al., 2018) 0.981 0.002 0.992 0.004 0.975 0.002 2029.00 0.10CoxCC (Katzman et al., 2018) 0.981 0.002 0.992 0.004 0.975 0.003 2553.32 9.43



Kernel (Chen, 2020) 0.988 0.020 0.936 0.178 0.988 0.038 2027.70 30.27


0.5

0.6

0.7

0.8

0.9

1.0

Cove

rage

censoreduncensored

(a) SUPPORT


0.800

0.825

0.850

0.875

0.900

0.925

0.950

0.975

1.000

Cove

rage censored

uncensored

(b) METABRIC

Figure 7. Empirical Coverage of Censored and Uncensored Data

B.2. Model and Hyperparameters

We use pycox and PyTorch to implement CoxCC,CoxPH and neural network model respectively. Then we combine themtogether to Cox-MLP models. We implement a neural network model with three hidden layers, where each layer has 32hidden nodes. Between each two layers, we use ReLU as the activation function. We apply batch normalization and dropoutwhich drop 10% nodes at one epoch. Adam is chosen to be the optimizer. In the training process, we feed 80% data astraining data and the rest as validation data. Note that here the total data set (training data + validation data) is the trainingdata mentioned in the main article. Each batch contains 128 data points. We train the network for 512 epochs and the trainedmodel is used as the MLP part in our Cox-MLP model.

B.3. Supplemental Results

The empirical coverage and predicted interval length of different models are listed in Table 2 and Table 3. The results aresimilar to that of RRNLNPH. It should be noticed that unweighted WCCI has very poor performance on censored data inSUPPORT. Both Kernel and unweighted models variate a lot on length of prediction interval.

Prediction on censored data and uncensored data are compared in Figure 7. Performance on censored data are much betterthan that of uncensored data. The standard deviation is also larger on uncensored data than censored data. This is consistentwith our intuition as we do not have exact information of censored data.


Table 3. Model Comparison on METABRICMethod Total Censored Uncensored Interval Length



MTLR (Yu et al., 2011) 0.990 0.005 0.990 0.008 0.990 0.005 306.08 6.44CoxPH (Katzman et al., 2018) 0.995 0.003 0.994 0.005 0.995 0.005 340.20 6.85CoxCC (Katzman et al., 2018) 0.996 0.004 0.994 0.005 0.998 0.004 344.26 6.93



Kernel (Chen, 2020) 0.981 0.025 0.997 0.010 0.971 0.038 337.66 20.39


0.74

0.78

0.82

0.86

0.9

0.940.95

0.98

Cove

rage

weightedunweighted

(a) SUPPORT


0.88

0.9

0.92

0.94

0.95

0.96

0.98

Cove

rage

weightedunweighted

(b) METABRIC

Figure 8. Empirical Coverage of Weighted and Unweighted Models


Comparison on weighted and unweighted models are shown in Figure 8. Performance of unweighted models are muchweaker than weighted models. This shows the effectiveness of weighted conformal inference.

We also compare performance on different α’s on both SUPPORT and METABRIC. Results similar with RRNLNPH areshown in Figure 9, and 10. The empirical coverage of our algorithms exceed the given confidence level and the standarddeviation of prediction is relatively low.


250

500

750

1000

1250

1500

1750

2000

Leng

ht


(a) SUPPORT


0

250

500

750

1000

1250

1500

1750

Leng

ht


(b) SUPPORT


200

220

240

260

280

300

320

340

Leng

ht


(c) METABRIC


100

125

150

175

200

225

250

275

Leng

ht


(d) METABRIC

Figure 10. Different Model’s Empirical Coverage of Different α

Notice that in Table 2 and Table 3, censored coverage in CoxPH and CoxCC performs well (around 0.992 in Table 2 and0.004 in Table 3). However, we emphasize that any coverage type could happen due to a lack of theoretical guarantee(extremely large, extremely small, or highly unbalanced, etc.). The censored coverage of CoxPH happens to be large (0.992)in Table 2 (dataset: SUPPORT) and Table 3 (dataset: METABRIC), just like it happens to be small in Table 1 (0.554, dataset:



0.60

0.65

0.70

0.75

0.80

0.85

0.90

0.95

1.00

Cove

rage


(a) SUPPORT


0.4

0.5

0.6

0.7

0.8

0.9

Cove

rage


(b) SUPPORT


0.60

0.65

0.70

0.75

0.80

0.85

0.90

0.95

1.00

Cove

rage


(c) METABRIC


0.5

0.6

0.7

0.8

0.9

Cove

rage


(d) METABRIC

Figure 9. Different Model’s Empirical Coverage of Different α(SUPPORT)


RRNLNPH). As a comparison, our newly proposed method (T-SCI) is guaranteed to return a nearly perfect guarantee (allcensored coverages are larger than 0.90, and the whole coverages are larger than 0.95).

C. Supplementary NotesWe make some supplementary notes in this section.

C.1. Stability of Non-conformity Score

We state in Section 4 that the non-conformity score is usually more stable when it is single-peak. A multi-peak situationimplies that samples with different covariate may have different coverage, namely,

P(T ∈ Cn(X)|X = x1) 6= P(T ∈ Cn(X)|X = x2),

where x1, x2 belong to different peaks. Therefore, we describe the above formula as “unstable” since it provides distinctcoverage for distinct groups, although the overall coverage (for the population) is still 1− α.

The multi-peak phenomenon comes from the fact that we calculate the 1 − α quantile of Vi based on all samples. Forexample, consider a two-peak distribution where the first group has 1 − α populations. Then the algorithm returns zerocoverage (probability equal to zero) for the second group and returns one coverage (probability equal to one) for the firstgroup, which causes instability.

C.2. The Effect of Censoring

Figure 1 illustrates that ignoring and deleting censoring will indeed cause bias and inefficiency. We may also consider amore straightforward case to calculate the sample mean of a dataset. However, the dataset contains censoring issues, wherewe clip the data to a constant C when the value is larger than C. If we ignore the censoring phenomenon, the new samplemean is smaller than the expected value, leading to bias. If we delete the censoring phenomenon, the new sample meanis also smaller since we delete samples with large values (those deleted samples are always larger than the constant C).Besides, it leads to inefficiency since we use fewer samples during the inference.

C.3. Strong Ignorability Assumption

Strong ignorability assumption stands forT ⊥∆ | X.

Note that strong ignorability assumption directly leads to the fact:

PT |X,∆ = PT |X .

since PT |X,∆ = PT,∆|X/P∆|X = PT |X . Therefore, we can apply weighted conformal inference which requires a covariateshift. A similar idea could be found in Lei & Candes (2020).

C.4. Surrogate Empirical Coverage

In the experiment part, we use empirical coverage to evaluate the algorithm on a synthetic dataset, defined as the fraction oftesting points whose survival time falls in the predicted confidence band.

In the real-world dataset, we calculate surrogate empirical coverage (SEC) instead of empirical coverage due to thelack of survival time. SEC calculates the number when the censoring time is no larger than the band’s upper bound forcensored data. We emphasize that SEC is the upper bound of EC. For example, denote the returned confidence band asC(Xi) = [T l(Xi), T

u(Xi)] and denote Itest as the training set index, its empirical coverage is defined as:

EC =1

|Itest|∑

i∈Itest

I(Ti ∈ C(Xi))

its surrogate empirical coverage is defined as:

SEC =1

|Itest|∑

i∈Itest

(I(Yi ∈ C(Xi),∆i = 1) + I(Yi ≤ Tu(Xi),∆i = 0)

)

abstract arxiv:2103.04556v2 [cs.ds] 14 apr 2021

Documents