pairwise fairness for ranking and regression · rank a french restaurant above a mexican...

Pairwise Fairness for Ranking and Regression

Harikrishna Narasimhan, Andrew Cotter, Maya Gupta, Serena WangGoogle Research

1600 Amphitheatre Pkwy, Mountain View, CA 94043{hnarasimhan, acotter, mayagupta, serenawang}@google.com

Abstract

We present pairwise fairness metrics for ranking models andregression models that form analogues of statistical fairnessnotions such as equal opportunity, equal accuracy, and sta-tistical parity. Our pairwise formulation supports both dis-crete protected groups, and continuous protected attributes.We show that the resulting training problems can be efficientlyand effectively solved using existing constrained optimizationand robust optimization techniques developed for fair clas-sification. Experiments illustrate the broad applicability andtrade-offs of these methods.

IntroductionAs ranking models and regression models become moreprevalent and have a greater impact on people’s day-to-daylives, it is important that we develop better tools to quantify,measure, track, and improve fairness metrics for such models.A key question for ranking and regression is how to definefairness metrics. As in the binary classification setting, webelieve there is not one “right” fairness definition: instead,we provide a paradigm that makes it easy to define and trainfor different fairness definitions, analogous to those that arepopular for binary classification problems.

One key distinction is between unsupervised and super-vised fairness metrics: for example, consider the task of rank-ing restaurants for college students who prefer cheaper restau-rants, and suppose we wish to be fair to French vs Mexicanrestaurants. Our proposed unsupervised statistical parity con-straint would require that the model be equally likely to (i)rank a French restaurant above a Mexican restaurant, and(ii) rank a Mexican restaurant above a French restaurant.In contrast, our proposed supervised equal opportunity con-straint would require that the model be equally likely to (i)rank a cheap French restaurant above an expensive Mexicanrestaurant, and (ii) rank a cheap Mexican restaurant above anexpensive French restaurant.

Like some recent work on fair ranking [Beutel et al., 2019,Kallus and Zhou, 2019], we draw inspiration from the stan-dard learning-to-rank strategy [Liu, 2011]: we reduce the

Copyright c© 2020, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

ranking problem to that of learning a binary classifier to pre-dict the relative ordering of pairs of examples. This reductionof ranking to binary classification enables us to formulate abroad set of statistical fairness metrics, inspired by analoguesin the binary classification setting, in terms of pairwise com-parisons. The same general idea can actually be applied morebroadly: in addition to group-based fairness in the rankingsetting, we show that the same overall approach can also beapplied to (i) the regression setting, or (ii) the use of con-tinuous protected attributes instead of discrete groups. In allthree of these cases, we show how to effectively train rankingmodels or regression models to satisfy the proposed fairnessmetrics, by applying state-of-the-art constrained optimizationalgorithms.

Ranking Pairwise Fairness MetricsWe begin by considering a standard ranking set-up [Liu,2011]: we’re given a sample S of queries drawn i.i.d. froman underlying distribution D, where each query is a set ofcandidates to be ranked, and each candidate is representedby an associated feature vector x ∈ X and label y ∈ Y . Thelabel space can be, for example, Y = {0, 1} (e.g. for clickdata: y = 1 if a result was clicked by a user, y = 0 other-wise), Y = R (each result has an associated quality rating), orY = N (the labels are a ground truth ranking). We adopt theconvention that higher labels should be ranked closer to thetop. Any of these choices of label space Y induce a partial or-dering on examples, for which all candidates belonging to thesame query are totally ordered, and any two candidates (x, y)and (x′, y′) belonging to different queries are incomparable.

Suppose that we have a set of K protected groupsG1, . . . , GK partitioning the space of examples X × Y suchthat every example belongs to exactly one group. We definethe group-dependent pairwise accuracy AGi>Gj

as the ac-curacy of a ranking function f : X → R on those pairs forwhich the labeled “better” example belongs to group Gi, andthe labeled “worse” example belongs to group Gj . That is:

AGi>Gj:= (1)

P (f(x) > f(x′) | y > y′, (x, y) ∈ Gi, (x′, y′) ∈ Gj),

where (x, y) and (x′, y′) are drawn i.i.d. from the distri-bution of examples, restricted to the appropriate protected

arX

iv:1

906.

0533

0v3

[cs

.LG

] 7

Jan

202

0

groups. Notice that this definition implicitly forces us to con-struct pairs only from examples belonging to the same query,since y and y′ are not comparable if they belong to differentqueries—however, the probability is taken over all such pairs,across all queries. Given K groups, one can compute theK ×K matrix of all possible K2 group-dependent pairwiseaccuracies. One can also measure how each group performson average:

AGi>: :=P (f(x) > f(x′) | y > y′, (x, y) ∈ Gi) (2)

A:>Gi:=P (f(x) > f(x′) | y > y′, (x′, y′) ∈ Gi). (3)

The accuracy in (2) is averaged over all pairs for which theGiexample was labeled as “better,” and the “worse" example isfrom any group, including Gi. Similarly, (3) is the accuracyaveraged over all pairs where the Gi example should nothave been preferred. Lastly, the overall pairwise accuracyP (f(x) > f(x′) | y > y′) is simply the standard AUC.Next, we use the pairwise accuracies to define supervisedpairwise fairness goals and unsupervised fairness notions.

Pairwise Equal OpportunityWe construct a pairwise equal opportunity analogue of theequal opportunity metric [Hardt et al., 2016]:

AGi>Gj= κ, for some κ ∈ [0, 1], for all i, j (4)

Equal opportunity for binary classifiers [Hardt et al., 2016]requires positively-labeled examples to be equally likely tobe predicted positively regardless of protected group member-ship. Similarly, this pairwise equal opportunity for rankingproblems requires pairs to be equally-likely to be rankedcorrectly regardless of the protected group membership ofboth members of the pair. By symmetry, we could equallywell consider AGi>Gj

to be a true positive rate or a truenegative rate, so there is no distinction between “equal op-portunity” and “equal odds” in the ranking setting, when allof the pairwise accuracies are constrained equivalently.

Pairwise equal opportunity can be relaxed either byrequiring all pairwise accuracies (i) to only be withinsome quantity of each other (e.g. maxi6=j AGi>Gj −mini 6=j AGi>Gj ≤ 0.1), or (ii) only requiring the mini-mum pairwise accuracy AGi>Gj to be as big as possible(i.e. maximize mini 6=j AGi>Gj

), in the style of robust op-timization [e.g. Chen et al., 2017]. We will later show howmodels can be efficiently trained subject to both these typesof pairwise fairness constraints using existing algorithms.

Within-Group vs. Cross-Group ComparisonWe have observed that labels for within-group comparisons(i = j) are sometimes more accurate and consistent acrossraters than labels for cross-group comparisons (i 6= j) can benoisier and less consistent. This especially arises when the la-bels are coming from experts that are more comfortable withrating candidates from certain groups. For example, considera video ranking system where group i is sports videos andgroup j is cooking shows. If our experts can choose whichvideos they rate (as in most consumer recommendation sys-tems with feedback), sports experts are likely to rate sportsvideos and do so accurately, cooking experts are likely to rate

cooking shows and do so accurately, but on average we maynot get as accurate ratings on pairs with a sports and cookingvideo.

Thus one may wish to separately constrain cross-grouppairwise equal opportunity:

AGi>Gj=κ, for some κ ∈ [0, 1] for all i 6= j. (5)

and within-group pairwise equal accuracy:AGi>Gi

=κ′, for some κ′ ∈ [0, 1], for all i. (6)In certain applications, particularly those in which cross-group comparisons are rare or do not occur, we might wantto constrain only pairwise equal accuracy (6). For example,we might want a music ranking system to be equally accurateat ranking jazz as it is at ranking country music, but avoidtrying to constrain cross-group ranking accuracy because wemay not have confidence in cross-group ratings.

Marginal Equal OpportunityThe previous pairwise equal opportunity proposals are de-fined in terms of the K2 group-dependent pairwise accu-racies. This may be too fine-grained, either for statisticalsignificance reasons, or because the fine-grained constraintsmight be infeasible. To address this, we propose a loosermarginal pairwise equal opportunity criterion that asks forparity for each group averaged over the other groups:

AGi>: = κ for some κ ∈ [0, 1], for i = 1, . . . ,K. (7)

Statistical ParityOur pairwise setup can also be used to define unsupervisedfairness metrics. For any i 6= j, we define pairwise statisticalparity as:P (f(x) > f(x′) | (x, y) ∈ Gi, (x′, y′) ∈ Gj) = κ. (8)

A pairwise statistical parity constraint requires that if twocandidates are compared from different groups, then on aver-age each group has an equal chance of being top-ranked. Thisconstraint completely ignores the training labels, but that maybe useful when groups are so different that any comparisonis too apples-to-oranges to be legitimate, or if raters are notexpert enough to make useful cross-group comparisons.

Regression Pairwise Fairness MetricsConsider the standard regression setting in which f : X → Yattempts to predict a regression label for each example. Formost of the following proposed regression fairness metrics,we treat higher scores as more (or less) desirable, and weseek to control how often each group gets higher scores. Thisasymmetric perspective is applicable if the scores confer abenefit, such as regression models that estimate credit scoresor admission to college, or if the model scores dictate apenalty to be avoided, such as getting stopped by police.This asymmetry assumption that getting higher scores iseither preferred (or not-preferred) is analogous to the binaryclassification case where a positive label is assumed to confersome benefit.

We again propose defining metrics on pairs of examples.This is not a ranking problem, so there are no queries—instead, given a training set of N examples, we computepairwise metrics over all N2 pairs. One can sample a randomsubset of pairs if N2 is too large.

Regression Equal OpportunityOne can compute and constrain the pairwise equal opportu-nity metrics as in (4), (5), (6) and (7) for regression models.For example, restricting (5) constrains the model to be equallylikely for all groups Gi and Gj to assign a higher score togroup i examples over group j examples, if the group i ex-ample’s label is higher.

Regression Equal AccuracyPromoting pairwise equal accuracy as in (6) for regressionrequires that, for every group, the model should be equallyfaithful to the pairwise ranking of any two within-groupexamples. This is especially useful if the regression labels ofdifferent groups originate from different communities, andhave different labeling distributions. For example, supposethat all jazz music examples are rated by jazz lovers whoonly give 4-5 star ratings, but all classical music examplesare rated by critics who give a range of 1-5 star ratings, with5 being rare. Simply minimizing MSE alone might causethe model training to over-focus on the classical music scoreexamples, since the classical errors are likely to be larger andhence affect the MSE more.

Regression Statistical ParityFor regression, the pairwise statistical parity condition de-scribed in (8) requires, “Given two randomly drawn examplesfrom two different groups, they are equally likely to have thehigher score.” One sufficient condition to guarantee pairwisestatistical parity is to require the distribution of outputs f(X)for a random input X to be the same for each of the pro-tected groups. This condition can be enforced approximatelyby histogram matching the output distributions for differentprotected groups [e.g. Agarwal et al., 2019].

Regression Symmetric Equal AccuracyFor regression problems where each group’s goal is to beaccurate (rather than to score high or low), one can definesymmetric pairwise fairness metrics as well, for example, thesymmetric pair accuracy for group as Gi is AGi>: +A:>Gi ,and one might constrain these accuracies to be the sameacross groups.

Continuous Protected FeaturesSuppose we have a continuous or ordered protected featureZ; e.g. we may wish to constrain for fairness with respect toage, income, seniority, etc. The proposed pairwise fairnessnotions extend nicely to this setting by constructing the pairsbased on the ordering of the protected feature, rather onprotected group membership. Specifically, we change (1) tothe following continuous attribute pairwise accuracies:

A> := P (f(x) > f(x′) | y > y′, z > z′), (9)

A< := P (f(x) > f(x′) | y > y′, z < z′), (10)

where z is the protected feature value for (x, y) and z′ isthe protected feature value for (x′, y′). For example, if theprotected feature Z measures height, then A> measures theaccuracy of the model when comparing pairs where the can-didate who is taller should receive a higher score.

The previously proposed pairwise fairness constraints fordiscrete groups have analogous definitions in this setting byreplacing (1) with (9). Pairwise equal opportunity becomes

A> = A<. (11)

This requires, for example, that the model be equally accuratewhen the taller or shorter candidate should be higher ranked.1

Training for Pairwise FairnessWe show how one can use the pairwise fairness definitionsto specify a training objective, and how to optimize theseobjectives. We formulate the training problem for rankingand cross-group equal opportunity, but the formulation andalgorithms can be applied to any of the pairwise metrics.

Proposed Formulations LetAGi>Gj (f) be defined by (1)for a ranking model f : X → R. Let AUC(f) be the over-all pairwise accuracy. Let F be the class of models we areinterested in. We formulate training with fairness goals as aconstrained optimization with an allowed slack ε:

maxf∈F

AUC(f)

s.t. AGi>Gj(f)−AGk>Gl

(f) ≤ ε ∀i 6= j, k 6= l. (12)

Or one can pose the robust optimization problem:

maxf∈F, ξ

ξ

s.t. ξ ≤ AUC(f), ξ ≤ AGi>Gj(f) ∀i 6= j. (13)

For regression problems, we replace AUC with MSE.

Optimization Algorithms Both the constrained and robustoptimization formulations can be written in terms of rate con-straints [Goh et al., 2016] on score differences. For example,we can re-write each pairwise accuracy term as a positiveprediction rate on a subset of pairs:

AGi>Gj (f) = E[If(x)−f(x′)>0

∣∣ ((x, y), (x′, y′)) ∈ Sij],

where I is the usual indicator function and Sij ={((x, y), (x′, y′)) | y > y′, (x, y) ∈ Gi, (x′, y′) ∈ Gj}. Thisenables us to adopt algorithms for binary fairness constraintsto solve the optimization problems in (12) and (13).

In fact, all of the objective and constraint functions thatwe have considered can be handled out-of-the-box by theproxy-Lagrangian framework of Cotter et al. [2019a,b]. Likeother constrained optimization approaches [Agarwal et al.,2018, Kearns et al., 2018], this framework learns a stochasticmodel that is supported on a finite set of functions in F . Thehigh-level idea is to set up a min-max game, where one playerminimizes over the model parameters, and the other player

1Similar to the pairwise ranking metrics, A< is the true negativerate for pairs (x, y), (x′, y′) where z > z′, and by symmetry, A<

is also equal to the true positive rate for pairs where z < z′:

A< = P (f(x) < f(x′) | y < y′, z > z′)

= P (f(x) > f(x′) | y > y′, z < z′) = TPRz<z′ .

Therefore, (11) equates both the TPR and the TNR for both sets ofpairs, and specifies both equalized odds and equal opportunity.

maximizes over a weighting λ on the constraint functions.Cotter et al. [2019a] use a no-regret optimization strategyfor minimization over the model parameters, and a swap-regret optimization strategy for maximization over λ, withthe indicators I replaced with hinge-based surrogates for thefirst player only. They prove that, under certain assumptions,their optimizers converge to a stochastic model that satisfiesthe specified constraints in expectation. In the Appendix, wepresent more details about the optimization approach andre-state their theoretical result for our setting.

Related WorkWe review related work that we build upon in fair classi-fication, and then related work on the problems addressedhere: fair ranking, fair regression, and handling continuousprotected attributes.

Fair Classification Many statistical fairness metrics forbinary classification can be written in terms of rate con-straints, that is, constraints on the classifier’s positive (ornegative) prediction rate for different groups [Goh et al.,2016, Narasimhan, 2018, Cotter et al., 2019a,b]. For exam-ple, the goal of demographic parity [Dwork et al., 2012] isto ensure that the classifier’s positive prediction rate is thesame across all protected groups. Similarly, the equal oppor-tunity metric [Hardt et al., 2016] requires that true positiverates should be equal across all protected groups. Many otherstatistical fairness metrics can be expressed in terms of rates,e.g. equal accuracy, no worse off and no lost benefits [Cotteret al., 2019b]. Constraints on these fairness metrics can beadded to the training objective for a binary classifier, thensolved using constrained optimization algorithms or relax-ations thereof [Goh et al., 2016, Zafar et al., 2017, Doniniet al., 2018, Agarwal et al., 2018, Cotter et al., 2019a,b]. Here,we extend this work to train ranking models and regressionmodels with pairwise fairness constraints.

Fair Ranking A majority of the previous work on fairranking has focused on list-wise definitions for fairness thatdepend on the entire list of results for a given query [e.g.Zehlike et al., 2017, Celis et al., 2018, Biega et al., 2018,Singh and Joachims, 2018, Zehlike and Castillo, 2018, Singhand Joachims, 2019]. These include both unsupervised crite-ria that require the average exposure near the top of the rankedlist to be equal for different groups [e.g. Singh and Joachims,2018, Celis et al., 2018, Zehlike and Castillo, 2018], and su-pervised criteria that require the average exposure for a groupto be proportional to the average relevance of that group’sresults to the query [Biega et al., 2018, Singh and Joachims,2018, 2019]. Of these, some provide post-processing algo-rithms for re-ranking a given ranking [Biega et al., 2018,Celis et al., 2018, Singh and Joachims, 2018, 2019], whileothers, like us, learn a ranking model from scratch [Zehlikeand Castillo, 2018, Singh and Joachims, 2019].

Pairwise Fairness Beutel et al. [2019] propose rankingpairwise fairness definitions equivalent to those we give in(1), (2) and (3). Their work focuses on ranking and on cate-gorical protected groups, whereas we generalize these ideas

to capture a wider variety of different statistical fairness no-tions, and generalize to regression and continuous protectedfeatures.

The training methodology is also very different. Beutelet al. [2019] propose adding a fixed regularization term to thetraining objective that measures the correlation between theresidual between a clicked and unclicked item and the groupmemberships of the items. In contrast, we enable explicitlyspecifying any desired pairwise fairness constraints, and thendirectly enforce the desired pairwise fairness criterion usingconstrained optimization. Their approach is parameter-free,but only because it does not give the user any way to controlthe trade-off between fairness vs. accuracy.

Second, Beutel et al. consider only two protected groups,whereas we enable the user to constrain any number ofgroups, with the constrained optimization algorithm auto-matically determining how much each group must be penal-ized in order to satisfy the fairness constraints. A straight-forward extension of the fixed regularization approach ofBeutel et al. to multiple groups would have no hyperparame-ters to specify how much to weight each group. One couldintroduce separate weighting hyperparameters to weight eachgroup’s penalty, but then they would need to be tuned manu-ally. The approach we propose does this tuning automaticallyto achieve the desired fairness constraints.

Finally, there are major experimental differences to Beutelet al.: they provide an in-depth case study of one real-worldrecommendation problem, whereas we provide a broad setof experiments on public and real-world data illustrating theeffectiveness on both ranking and regression problems, forcategorical or continuous protected attributes.

In other recent works, Kallus and Zhou [2019] providepairwise fairness metrics based on AUC for bipartite rankingproblems, and Kuhlman et al. [2019] provide metrics analo-gous to our pairwise equal opportunity and statistical paritycriteria for ranking. Both these works only consider categori-cal groups, whereas we also handle regression problems andcontinuous protected attributes. Kallus and Zhou [2019] addi-tionally propose a post-processing optimization approach tooptimize the specified metrics by fitting a monotone transfor-mation to an existing ranking model. In contrast, we providea more flexible approach that enables optimizing the entiremodel by including explicit constraints during training.

Pinned AUC Pinned AUC is a fairness metric introducedby Dixon et al. [2018]. With two protected groups, pinnedAUC works by resampling the data such that each of thetwo groups make up 50% of the data, and then calculatingthe ROC AUC on the resampled dataset. Based on the well-known equivalence between ROC AUC and average pairwiseaccuracy, Borkan et al. [2019] demonstrate that pinned AUC,as well as their proposed weighted pinned AUC metric, canbe decomposed as a linear combination of within-group andcross-group pairwise accuracies. In other words, both pinnedAUC and weighted pinned AUC can be written as linearcombinations of different pairwise accuracies AGi>Gj

in (1).In our experiments, we compare against (a version of) thesampling-based approach of Dixon et al. [2018].

Fair Regression Defining fairness metrics in a regressionsetting is a challenging task, and has been studied for manyyears in the context of standardized testing [e.g. Hunter andSchmidt, 1976]. Komiyama et al. [2018] consider the un-fairness of a regressor in terms of the correlation betweenthe output and a protected attribute. Pérez-Suay et al. [2017]regularize to minimize the Hilbert-Schmidt independencebetween the protected features and model output. These def-initions have the “flavor” of statistical parity, in that theyattempt to remove information about the protected featurefrom the model’s predictions. Here, we focus more on super-vised fairness notions.

Berk et al. [2017] propose regularizing linear regressionmodels for the notion of fairness corresponding to the princi-ple that similar individuals receive similar outcomes [Dworket al., 2012]. Their definitions focus on enforcing similarsquared error, which fundamentally differs from our defi-nitions in that we assume each group would prefer higherscores, not necessarily more accurate scores.

Agarwal et al. [2019] propose a bounded group loss defi-nition which requires that the regression error be within anallowable limit for each group. In contrast, our pairwise equalopportunity definitions for regression do not rely on a specificregression loss, but instead are based on the ordering inducedby the regression model within and across groups.

Continuous Protected Features Most prior work in ma-chine learning fairness has assumed categorical protectedgroups, in some cases extending those tools to continuous fea-tures by bucketing [Kearns et al., 2018]. Fine-grained bucketsraise statistical significance challenges, and coarse-grainedbuckets may raise unfairness issues due to how the linesbetween bins are drawn, and the lack of distinctions made be-tween element within each bin. Raff et al. [2018] consideredcontinuous protected features in their tree-growing criterionthat addresses fairness. Kearns et al. [2018] focused on statis-tical parity-type constraints for continuous protected featuresfor classification. Komiyama et al. [2018] controlled the cor-relation of the model output with protected variables (whichmay be continuous). Mary et al. [2019] propose a fairness cri-terion for continuous attributes based on the Rényi maximumcorrelation coefficient. Counterfactual fairness [Kusner et al.,2017, Pearl et al., 2016] requires that changing a protectedattribute, while holding causally unrelated attributes constant,should not change the model output distribution, but this doesnot directly address issues with ranking fairness.

ExperimentsWe illustrate our proposals on five ranking problems and tworegression problems. We implement the constrained and ro-bust optimization methods using the open-source Tensorflowconstrained optimization toolbox of Cotter et al. [2019a,b].The datasets used are split randomly into training, valida-tion and test sets in the ratio 1/2:1/4:1/4, with the validationset used to tune the relevant hyperparameters. For datasetswith queries, we evaluate all metrics for individual queriesand report the average across queries. For stochastic models,we report expectations over random draws of the scoring

function f from the stochastic model.2

Pairwise Fairness for RankingWe detail the comparisons and ranking problems.

Comparisons We compare against: (1) an adaptation ofthe debiasing scheme of Dixon et al. [2018] that optimizes aweighted pairwise accuracy, with the weights chosen to bal-ance the relative label proportions within each group; (2) therecent non-pairwise ranking fairness approach by Singh andJoachims [2018] that re-ranks the scores of an unconstrainedranking model to satisfy a disparate impact constraint; (3) thepost-processing pairwise fairness method of Kallus and Zhou[2019] that fits a monotone transform to an unconstrainedmodel; and (4) the fixed regularization pairwise approach ofBeutel et al. [2019] that like us incorporates the fairness goalinto the model training. See Appendix for more details.

Simulated Ranking Data For this toy ranking task withtwo features, there are 5,000 queries, and each query has11 candidates. For each query, we uniformly randomlypick one of the 11 candidates to have a positive labely = +1 and the other 10 candidates receive a negativelabel y = −1, and we randomly assign each candidate’sprotected attribute z i.i.d. from a Bernoulli(0.1) distribu-tion. Then we generate two features simulated to score howwell the candidate matches the query, from a Gaussian dis-tribution N (µy,z, Σy,z), where µ−1,0 = [−1, 1], µ−1,1 =[−2,−1], µ+1,0 = [1, 0], µ+1,1 = [−1.5, 0.75], Σ−1,0 =Σ−1,1 = Σ+1,0 = I2 and Σ+1,1 = 0.5 I2.

We train linear ranking functions f : R2 → R and im-pose a cross-group equal opportunity with constrained op-timization by constraining |A0>1 − A1>0| ≤ 0.01. For therobust optimization, we implement this goal by maximizingmin{A0>1, A1>0, AUC}. We also train an unconstrainedmodel that optimizes AUC. Table 1 gives the test ranking ac-curacy, and the test pairwise fairness violations, measured as|A0>1 −A1>0|. Only constrained optimization achieves thefairness goal, with robust optimization coming a close second.Figure 1(a)–(b) shows the 2× 2 pairwise accuracy matrices.Constrained optimization satisfies the fairness constraint bylowering A0>1 and improving A1>0.

We also generate a second dataset with 3 groups, wherethe first two groups follow the same distribution as groups0 and 1 above, and the third group examples are drawnfrom a Gaussian distribution N (µy,2, Σy,2) where µ−1,2 =[−1, 1], µ+1,2 = [1.5, 0.5], and Σ−1,2 = Σ+1,2 = I2. Weuse the same number of queries and candidates as above,and assign the protected attribute z to 0, 1, or 2 with prob-abilities 0.45, 0.1, and 0.45 respectively. We impose themarginal equal opportunity fairness goal on this datasetin two different ways: (i) constraining maxi 6=j |Ai>: −Aj>:| ≤ 0.01 with constrained optimization, and (ii) opti-mizing min{AUC,A0>:, A1>:, A2>:} with robust optimiza-tion. We show each group’s row-marginal test accuracies inFigure 1(c)–(d). While robust optimization maximizes theminimum of the three marginals, constrained optimization

2Code available at: https://github.com/google-research/google-research/tree/master/pairwise_fairness

https://github.com/google-research/google-research/tree/master/pairwise_fairness


NegativePo

sitiv

e G0 G1

G0 0.941 0.980G1 0.705 0.894

(a) Sim./Unconstrained

Negative

Posi

tive G0 G1

G0 0.854 0.900G1 0.907 0.927

(b) Sim./Constrained

G0 G1 G2

0.713 0.703 0.729

(c) Sim./Constrained

G0 G1 G2

0.882 0.876 0.949

(d) Sim./Robust

Non-toxic

Toxi

c Other GayOther 0.973 0.882Gay 0.986 0.937

(e) Wiki/Unconstrained

Non-toxic

Toxi


(f) Wiki/Debiased

Non-toxic

Toxi


(g) Wiki/Constrained

Figure 1: Test pairwise accuracy matrices for Simulated ranking with 2 groups (a)–(b), row-based matrix averages A0>:, A1>:

and A2>: for Simulated ranking with 3 groups (c)–(d), and pairwise accuracy matrices for Wiki Talk Pages ranking (e)–(g).

Table 1: Test AUC (higher is better) with test pairwise fairness violations (in parentheses). For fairness violations, we report|AG0>G1

−AG1>G0| when imposing cross-group constraints, |AG0>: −AG1>:| for marginal constraints, and |A> −A<| for

continuous protected attributes. Blue indicates strictly best between Beutel et al. [2019] and constrained optimization.

Data Groups Uncons. Debiased S & J K & Z B et al. Constr. RobustSim. 0/1 0.92 (0.28) 0.92 (0.28) 0.88 (0.14) 0.91 (0.12) 0.84 (0.04) 0.86 (0.01) 0.86 (0.02)Busns. C/NC 0.70 (0.06) 0.70 (0.06) – 0.66 (0.00) 0.69 (0.05) 0.68 (0.00) 0.68 (0.07)Wiki Term ‘Gay’ 0.97 (0.10) 0.97 (0.01) – 0.97 (0.04) 0.95 (0.01) 0.96 (0.01) 0.94 (0.02)W3C Gender 0.53 (0.96) 0.54 (0.90) 0.37 (0.85) 0.45 (0.65) 0.55 (0.09) 0.54 (0.10) 0.54 (0.14)Crime Race % 0.93 (0.18) – – – 0.91 (0.10) 0.81 (0.04) 0.86 (0.04)

yields a lower difference between the marginals (and does soat the cost of lower accuracies for the three groups). This isconsistent with the two optimization problem set-ups: youget what you ask for.

We provide further results and an additional experimentwith an in-group equal opportunity criterion in the appendix.

Business Matching This is a proprietary dataset from alarge internet services company of ranked pairs of relevantand irrelevant businesses for different queries, for a totalof 17,069 pairs. How well a query matches a candidate isrepresented by 41 features. We consider two protected groups,chain (C) businesses and not chain (NC) businesses. Wedefine a candidate as a member of the chain group if itsquery is seeking a chain business and the candidate is a chainbusiness. We define a candidate as a member of the non-chain group if its query is not seeking a chain business andthe candidate is a non-chain business. A candidate does notbelong to either group if it is chain and the query is non-chain-seeking, or vice-versa.

We experiment with imposing a marginal equal opportu-nity constraint: |Achain>: −Anon−chain>:| ≤ 0.01. This re-quires the model to be roughly as accurate at correctly match-ing chains as it at matching non-chains. With robust optimiza-tion, we maximize min{Achain>:, Anon−chain>:, AUC}.All methods trained a two-layer neural network model with10 hidden nodes. As seen in Table 1, compared to the un-constrained approach, constrained optimization yields verylow fairness violation, while only being marginally worse onthe test AUC. The post-processing approach of [Kallus andZhou, 2019] also achieves a similar fairness metric, but witha lower AUC. Singh and Joachims [2018] failed to producefeasible solutions for this dataset, we believe because therewere very few pairs per query.

Wiki Talk Page Comments This public dataset contains127,820 comments from Wikipedia Talk Pages labeled withwhether or not they are toxic (i.e. contain “rude, disrespect-ful or unreasonable” content [Dixon et al., 2018]). This is adataset where debiased weighting has been effective in learn-ing fair, unbiased classification models [Dixon et al., 2018].We consider the task of learning a ranking function that rankscomments that are labeled toxic higher than the commentsthat are labeled non-toxic, in order to help the model’s usersidentify toxic comments. We consider the protected attributedefined by whether the term ‘gay’ appears in the comment.This is one of the many identity terms that Dixon et al. [2018]consider in their work. Among comments that have the term‘gay’, 55% are labeled toxic, whereas among comments thatdo not have the term ‘gay’, only 9% are labeled toxic. Welearn a convolutional neural network model with the samearchitecture used in Dixon et al. [2018].

We consider a cross-group equal opportunity crite-rion. We impose |AOther>Gay − AGay>Other| ≤ 0.01with constrained optimization and maximizemin{AOther>Gay, AGay>Other, AUC} with robustoptimization. The results are shown in Table 1 and Fig-ure 1(e)-(g). Among the cross-group errors, the unconstrainedmodel is more likely to incorrectly rank a non-toxic commentwith the term ‘gay’ over a toxic comment without the term.By balancing the label proportions, debiased weightingreduces the fairness violation considerably. Constrainedoptimization yields even lower fairness violation (0.010 vs.0.014), but at the cost of a slightly lower test AUC. Singhand Joachims [2018] could not be applied to this dataset as itdid not have the required query-candidate structure.

W3C Experts Search We also evaluate our methods onthe W3C Experts dataset, previously used to study disparate

Dataset Prot. Group Unconstrained Beutel et al. ConstrainedLaw Gender 0.142 (0.30) 0.167 (0.06) 0.143 (0.02)Crime Race % 0.021 (0.33) 0.033 (0.02) 0.028 (0.03)

Low

Hig

h Male FemaleMale 0.652 0.647

Female 0.666 0.655

Table 2: Left: Regression test MSE (lower is better) and pairwise fairness violation (in parenthesis), with blue indicating strictlybest between last two columns. Right: Test pairwise accuracy matrix for constrained optimization on Law School dataset.

exposure in ranking [Zehlike and Castillo, 2018]. This is asubset of the TREC 2005 enterprise track data, and consistsof 48 topics and 200 candidates per topic, with each candidatelabeled as an expert or non-expert for the topic. The task is torank the candidates based on their expertise on a topic, using acorpus of mailing lists from the World Wide Web Consortium(W3C). This is an application where the unconstrained algo-rithm does better for the minority protected group. We usethe same features as Zehlike and Castillo [2018] to representhow well each topic matches each candidate; this includes aset of five aggregate features derived from word counts andtf-idf scores, and the gender protected attribute.

For this task, we learn a linear model and impose across-group equal opportunity constraint: |AFemale>Male −AMale>Female| ≤ 0.01. For robust optimization, we maxi-mize min{AFemale>Male,AMale>Female, AUC}. As seenin Table 1, the unconstrained ranking model incurs a hugefairness violation. This is because the unconstrained modeltreats gender as a strong signal of expertise, and often ranksfemale candidates over male candidates. Not only do theconstrained and robust optimization methods achieve signifi-cantly lower fairness violations, they also happen to producehigher test metrics due to the constraints acting as regulariz-ers and reducing overfitting. On this task, Beutel et al. [2019]achieves the lowest fairness violation and the highest AUC.

The method of Singh and Joachims [2018] performs poorlybecause the LP that it solves per query turns out to be infeasi-ble for most queries in this dataset. Thus, to run this baseline,we extended their approach to have a per-query slack in theirdisparate impact constraints. This required a large slack forsome queries, hurting the overall performance.

Communities and Crime (Continuous Groups) We nexthandle a continuous protected attribute in a ranking problem.We use the Communities and Crime dataset from UCI [Duaand Graff, 2017] which contains 1,994 communities in theUnited States described by 140 features, and the per capitacrime rate for each community. As in prior work [Cotter et al.,2019a], we label the communities with a crime rate above the70th percentile as ‘high crime’ and the others as ‘low crime’,and consider the task of learning a ranking function that rankshigh crime communities above the low crime communities.We treat the percentage of black population in a communityas a continuous protected attribute.

We learn a linear ranking function, with the protected at-tribute included as a feature. We do not compare to debiasing,and the methods of [Singh and Joachims, 2018] and [Kallusand Zhou, 2019], as they do not apply to continuous protectedattributes. Adopting the continuous attribute equal opportu-nity criterion, we impose the constraint |A< −A>| ≤ 0.01.We extend [Beutel et al., 2019] to optimize this pairwise met-

ric. Table 1 shows the constrained and robust optimizationmethods reduce the fairness violation by more than half, atthe cost of a lower test AUC.

Pairwise Fairness for RegressionWe next present experiments on two regression problems.

We extend the set-up of Beutel et al. [2019] to also handleour proposed regression pairwise metrics, and compare tothat. We do not use robust optimization as the squared erroris not necessarily comparable with the regression pairwisemetrics. The results are shown in Table 2.

Law School: This dataset [Wightman, 1998] contains de-tails of 27,234 law school students, and we predict the under-graduate GPA for a student from the student’s LSAT score,family income, full-time status, race, gender and the lawschool cluster the student belongs to, with gender as the pro-tected attribute. We impose a cross-group equal opportunityconstraint: |AFemale>Male − AMale>Female| ≤ 0.01. Theconstrained optimization approach successfully massivelyreduces the fairness violation compared to the unconstrainedMSE-optimizing model, at only a small increase in MSE. Italso performs strictly better than Beutel et al. [2019].

Communities and Crime: This dataset has continuous la-bels for the per capita crime rate for a community. Once again,we treat the percentage of black population in a communityas a continuous protected attribute and impose a continuousattribute equal opportunity constraint: |A> − A<| ≤ 0.01.The constrained approach yields a huge reduction in fairnessviolation, though at the cost of an increase in MSE.

ConclusionsWe showed that pairwise fairness metrics can be intuitivelydefined to handle supervised and unsupervised notions offairness, for ranking and regression, and for discrete andcontinuous protected attributes. We also showed how pair-wise fairness metrics can be incorporated into training usingstate-of-the-art constrained optimization solvers.

Experimentally, the different methods compared often pro-duced different trade-offs between AUC (or MSE) and fair-ness, making it hard to judge one as strictly better than others.However, we showed that the proposed constrained optimiza-tion approach is the most flexible and direct of the strategiesconsidered, and is very effective for achieving low pairwisefairness violations. The closest competitor to our proposalsis the approach of Beutel et al. [2019], but out of the 4 casesin which one of these two methods strictly performed better(indicated in blue), ours was the best in 3 of the 4. Lastly,Kallus and Zhou [2019], being a post-processing method, ismore restricted, and did not perform as well as the proposedapproach that directly trains a model from scratch.

The key way one specifies pairwise fairness metrics is bythe selection of which pairs to consider. Here, we focusedon within-group and cross-group candidates. One could alsobring in side information or condition on other features. Forexample, in the ranking setting, we might have side infor-mation about the presentation order that candidates for eachquery were shown to users when labeled, and this positioninformation could be used to either select or weight candidatepairs. In the regression setting, we could assign weights toexample pairs based on their label differences.

ReferencesA. Agarwal, A. Beygelzimer, M. Dudík, J. Langford, and

H. M. Wallach. A reductions approach to fair classification.In ICML, 2018.

A. Agarwal, M. Dudik, and Z. S. Wu. Fair regression: Quanti-tative definitions and reduction-based algorithms. In ICML,2019.

R. Berk, H. Heidari, S. Jabbari, M. Joseph, M. Kearns, J. Mor-genstern, S. Neel, and A. Roth. A convex framework forfair regression. FAT/ML, 2017.

A. Beutel, J. Chen, T. Doshi, H. Qian, L. Wei, Y. Wu, L. Heldt,Z. Zhao, L. Hong, E. H. Chi, and C. Goodrow. Fairnessin recommendation through pairwise experiments. KDD,2019.

A. J. Biega, K. P. Gummadi, and G. Weikum. Equity ofattention: Amortizing individual fairness in rankings. InACM SIGIR, 2018.

D. Borkan, L. Dixon, J. Sorensen, N. Thain, and L. Vasser-man. Nuanced metrics for measuring unintended bias withreal data for text classification. In WWW, 2019.

L. Elisa Celis, Damian Straszak, and Nisheeth K. Vishnoi.Ranking with fairness constraints. In ICALP, 2018.

R. S. Chen, B. Lucier, Y. Singer, and V. Syrgkanis. Robustoptimization for non-convex objectives. In NIPS, 2017.

A. Cotter, H. Jiang, and K. Sridharan. Two-player gamesfor efficient non-convex constrained optimization. In ALT,2019a.

A. Cotter, H. Jiang, S. Wang, T. Narayan, M. R. Gupta, S. You,and K. Sridharan. Optimization with non-differentiableconstraints with applications to fairness, recall, churn, andother goals. JMLR, 2019b. [To appear].

L. Dixon, J. Li, J. Sorensen, N. Thain, and L. Vasserman.Measuring and mitigating unintended bias in text classifi-cation. In AIES, 2018.

M. Donini, L. Oneto, S. Ben-David, J. Shawe-Taylor, andM. Pontil. Empirical risk minimization under fairnessconstraints. NeurIPS, 2018.

D. Dua and C. Graff. UCI machine learning repository, 2017.URL http://archive.ics.uci.edu/ml.

C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel.Fairness through awareness. In ITCS, 2012.

G. Goh, A. Cotter, M. R. Gupta, and M. P. Friedlander. Satis-fying real-world goals with dataset constraints. In NIPS,2016.

M. Hardt, E. Price, and N. Srebro. Equality of opportunity insupervised learning. In NIPS, 2016.

J. E. Hunter and F. L. Schmidt. Critical analysis of thestatistical and ethical implications of various definitions oftest bias. Psychological Bulletin, 1976.

N. Kallus and A. Zhou. The fairness of risk scores be-yond classification: Bipartite ranking and the xAUC metric.NeurIPS, 2019.

M. Kearns, S. Neel, A. Roth, and S. Wu. Preventing fair-ness gerrymandering: Auditing and learning for subgroupfairness. ICML, 2018.

J. Komiyama, A. Takeda, J. Honda, and H. Shimao. Noncon-vex optimization for regression with fairness constraints.In ICML, 2018.

C. Kuhlman, M. VanValkenburg, and E. Rundensteiner.FARE: Diagnostics for fair ranking using pairwise errormetrics. WWW, 2019.

M. Kusner, J. Loftus, C. Russell, and R. Silva. Counterfactualfairness. NIPS, 2017.

T.-Y. Liu. Learning to Rank for Information Retrieval.Springer, 2011.

J. Mary, C. Calauzènes, and N. E. Karoui. Fairness-awarelearning for continuous attributes and treatments. In ICML,2019.

H. Narasimhan. Learning with complex loss functions andconstraints. In AISTATS, 2018.

J. Pearl, M. Glymour, and N. Jewell. Causal Inference inStatistics: a Primer. Wiley, 2016.

A. Pérez-Suay, V. Laparra, G. Mateo-García, J. Muñoz-Marí,L. Gómez-Chova, and G. Camps-Valls. Fair kernel learn-ing. In ECML PKDD, 2017.

E. Raff, J. Sylvester, and S. Mills. Fair forests: Regularizedtree induction to minimize model bias. AIES, 2018.

A. Singh and T. Joachims. Fairness of exposure in rankings.In KDD, 2018.

A. Singh and T. Joachims. Policy learning for fairness inranking. NeurIPS, 2019.

L. Wightman. LSAC national longitudinal bar passage study.Law School Admission Council, 1998.

M. B. Zafar, I. Valera, M. G. Rodriguez, and K. P. Gummadi.Fairness constraints: Mechanisms for fair classification. InAISTATS, 2017.

M. Zehlike and C. Castillo. Reducing disparate exposurein ranking: A learning to rank approach. arXiv preprint,arXiv:1805.08716, 2018.

M. Zehlike, F. Bonchi, C. Castillo, S. Hajian, M. Megahed,and R. A. Baeza-Yates. FA*IR: A fair top-k ranking algo-rithm. In CIKM, 2017.

http://archive.ics.uci.edu/ml

Appendix

Table 3: Example group-dependent pairwise accuracy matrix of K = 3 groups, and the frequency-weighted row mean andcolumn mean for the restaurant ranking example in the introduction.

Expensive

Che

ap

French Mexican Chinese Row MeanFrench 0.76 0.68 0.56 0.70

Mexican 0.61 0.62 0.55 0.60Chinese 0.62 0.61 0.56 0.60

Col. Mean 0.69 0.65 0.56 0.65

Proxy-Lagrangian OptimizationFor completeness, we provide details of the proxy-Lagrangian optimization approach for solving the constrained optimization in(12). For ease of exposition, we consider a setting with two protected groups, i.e. K = 2. Denoting ∆ij(f) = AGi>Gj

(f)−AGj>Gi

(f), we can restate (12) for two protected groups as:

maxθ∈Θ

AUC(fθ) s.t. ∆01(fθ) ≤ ε, ∆10(fθ) ≤ ε, (14)

where the ranking model fθ is parameterized by θ and Θ is a set of model parameters we wish to optimize over.Since the pairwise metrics are non-continuous in the model parameters (due to the presence of indicator functions I), the

approach of Cotter et al. [2019a] requires us to choose a surrogate for the AUC objective that is concave in θ and satisfies:

AUC(fθ) ≤ AUC(fθ),

and surrogates for the constraint differences ∆ij that are convex in θ and satisfy:

∆ij(fθ) ≥ ∆ij(fθ).

In our experiments, we construct these surrogates by replacing the indicator functions I in the pairwise metrics with hinge-basedupper/lower bounds. Cotter et al. [2019a] then introduce Lagrange multipliers or weighting parameters for the objective andconstraints λ and defines two proxy-Lagrangian functions:

Lθ(fθ, λ) := λ1AUC(fθ) + λ2

(∆01(fθ)− ε

)+ λ3

(∆10(fθ)− ε

);

Lλ(fθ, λ) := λ2 (∆01(fθ)− ε) + λ3 (∆10(fθ)− ε) ,where λ ∈ ∆3 is chosen from the 3-dimensional simplex. The goal is to maximize L over the model parameters, and minimizeLλ over λ, resulting in a two-player game between a θ-player and a λ-player.

Note that the θ-player’s objective Lθ is defined using the surrogates AUC and ∆ijs (this is needed for optimization overmodel parameters), while the second player’s objective Lλ is defined using the original pairwise metrics (as the optimizationover λ does not need the indicators to be relaxed).

The proxy-Lagrangian approach uses a no-regret optimization strategy with step-size ηθ to optimize Lθ over model parametersθ, and a swap-regret optimization strategy with step-size ηλ to optimize Lλ over λ. See Algorithm 2 of Cotter et al. [2019a]for specific details of this iterative procedure. The result of T steps of this optimization is a stochastic model µ supported on Tmodels, which under certain assumptions, is guaranteed to satisfy the specified cross-group constraints. We adapt and re-state theconvergence guarantee of Cotter et al. [2019a] for our setting:Theorem 1 (Cotter et al. [2019a]). Let Θ be a compact convex set. For a given choice of step-sizes ηθ and ηλ, let θ1, . . . , θT andλ1, . . . , λT be the iterates returned by the proxy-Lagrangian optimization algorithm (Algorithm 2 of Cotter et al. [2019a]) afterT iterations. Let µ be a stochastic model with equal probability on each of fθ1 , . . . , fθT and let λ = 1

T

∑Tt=1 λt. Then there

exists choices of step sizes ηθ and ηλ for which µ thus obtained satisfies with probability ≥ 1 − δ (over draws of stochasticgradients by the algorithm):

Ef∼µ[AUC(f)

]≥ max

f : ∆01(f)≤ε, ∆10(f)≤εAUC(f) − O

(1√T

)and

Ef∼µ[∆01(f)] ≤ ε + O(

1

λ1

√T

); Ef∼µ[∆10(f)] ≤ ε + O

(1

λ1

√T

),

where O only hides logarithmic factors.Under some additional assumptions (that there exists model parameters that strictly satisfy the specified constraints with a

margin), Cotter et al. [2019a] show that for a large T , the term λ1 in the above guarantee is essentially lower-bounded by aconstant, and the returned stochastic model is guaranteed to closely satisfy the desired pairwise fairness constraints in expectation.

Figure 2: Plot of learned hyperplanes on simulated ranking data with 2 groups. For constrained and robust optimization, we plotthe hyperplane that is assigned the highest probability in the support of the learned stochastic model.

Table 4: Test AUC (higher is better) with test pairwise fairness violations (in parentheses) for a simulated ranking task with twogroups and two sets of fairness goals: a cross-group equal opportunity criterion and an in-group equal opportunity criterion. Forfairness violations, we report max{|AG0>G1 −AG1>G0 |, |AG0>G0 −AG1>G1 |}.

Dataset Prot. Group Unconstrained Debiased Constrained RobustSim. CG & IG 0/1 0.92 (0.28) 0.92 (0.28) 0.75 (0.05) 0.89 (0.05)

Negative

Posi

tive G0 G1

G0 0.941 0.980G1 0.705 0.894

(a) Unconstrained

Negative

Posi

tive G0 G1

G0 0.743 0.769G1 0.779 0.797

(b) Constrained

Negative

Posi

tive G0 G1

G0 0.881 0.930G1 0.897 0.935

(c) Robust

Figure 3: Test pairwise accuracy matrix for simulated (ranking) data two groups, with both cross-group and in-group criteria.

Connection to Weighted Pairwise AccuracyAs discussed in the related works, pinned AUC [Dixon et al., 2018] is a previous fairness metrics that is a weighted sum ofthe entries of the pairwise accuracy matrix:

∑i,j βi,jAGi>Gj

(f). The constrained and robust optimization formulations thatwe propose are equivalent to maximizing the weighted pairwise accuracy for a specific, optimal set of weights {βi,j} (seeKearns et al. [2018], Agarwal et al. [2018] for relationship between constrained optimization and cost-weighted learning). Theconstrained optimization solver essentially finds the optimal set of weights {βi,j} for the two formulations. This is convenientbecause it enables us to specify fairness criteria in terms of pairwise accuracies, rather than having to manually tune the {βi,j}needed to produce the pairwise accuracies we want.

Additional Experimental Details/ResultsWe provide further details about the experimental set-up and present additional results.

Setup We use Adam for gradient updates. For the Wiki Talk Pages dataset and the Law School dataset, we compute stochasticgradients using minibatches of size 100 to better handle the large number of pairs to be enumerated. For all other datasets,we enumerate all pairs and compute full gradients. The datasets are split into training, validation and test sets in the ratio1/2:1/4:1/4, with the validation set used to tune the learning rate, the number of training epochs for the unconstrained optimizationmethods, and for the post-processing shrinking step in the proxy-Lagrangian solver. We have made the code available at:https://github.com/google-research/google-research/tree/master/pairwise_fairness.

Comparisons We give more details about the debiased-weighting baseline. This is an imitation of the debiasing scheme ofDixon et al. [2018] by optimizing a weighted pairwise accuracy (without any explicit constraints):

1

n+n−

∑i,j:yi>yj

αzi,yiαzj ,yj1(f(xi) > f(xj)),

where n+, n− are the number of positively labeled and negatively labeled training examples, and αz,y > 0 is a non-negative weight on each label and protected group are set such that the relative proportions of positive and negative ex-amples within each protected group are balanced. Specifically, we fix α0,−1 = α0,+1 = α1,+1 = 1 and set α1,−1 so that|{xi | zi=0,yi=−1}||{xi | zi=0,yi=+1}| = α1,−1

|{xi | zi=1,yi=−1}||{xi | zi=1,yi=+1}| . This mimics Dixon et al. [2018] where they sample additional negative docu-

ments belonging group 1, so that the relative label proportions within each group are similar.


Table 5: Test AUC (higher is better) with test pairwise fairness violations (in parentheses) for a simulated ranking task with 3groups and with the marginal equal opportunity fairness goal. For fairness violations, we report maxi 6=j |AGi>: −AGj>:|.

Dataset Prot. Group Unconstrained Constrained RobustSim. Marginal {0, 1, 2} 0.95 (0.28) 0.72 (0.03) 0.91 (0.07)

Hyperparameter Choices. The unconstrained approach for optimizing AUC (MSE) is run for 2500 iterations, with thestep-size chosen from the range {10−3, 10−2, . . . , 10} to maximize AUC (or minimize MSE) on the validation set. The proxy-Lagrangian algorithm that we use for constrained and robust optimization is also run for 2500 iterations. We fix the step-sizes ηθand ηλ for this solver to the same value, and choose this value from the range {10−3, 10−2, . . . , 10}. We use a heuristic providedin Cotter et al. [2019b] to pick the step-size that best trades-off between the objective and constraint violations on the validationset. The final stochastic classifier for the constrained and robust optimization methods is constructed as follows: we record 100snapshots of the iterates at regular intervals and apply the “shrinking” procedure of Cotter et al. [2019b] to construct a sparsestochastic model supported on at most J + 1 iterates, where J is the number of constraints in the optimization problem.

For ranking settings, the fixed regularization approach of Beutel et al. [2019] optimizes a sum of a hinge relaxation to the AUCand their proposed correlation-based regularizer; for regression setting, it optimizes a sum of the squared error and their proposedregularizer. We run this method for 2500 iterations and choose its step-size from the range {10−3, 10−2, . . . , 10}, picking thevalue that yields lowest regularized objective value on the validation set. For the post-processing approach of Kallus and Zhou[2019], as prescribed, we fit a monotone transform φ(z) = 1

1+exp(−(αz+β)) to the unconstrained AUC-optimizing model, andtune α from the range {0, 0.05, . . . , 5} and β from the range {−1,−2,−5,−10}, choosing values for which the transformedscores yield lowest fairness violation on the validation set. We implement the approach of Singh and Joachims [2018] by solvingan LP per query to post-process and re-rank the scores from the unconstrained AUC-optimizing model to satisfy their proposeddisparate impact criterion. We include a per-query slack to make the LPs feasible.

Experiment Constraining All Matrix Entries For the simulated ranking task with two groups, we present results for a secondexperiment, where we seek to enforce both cross-group equal opportunity and in-group equal accuracy criteria by constrainingboth |A0>1 −A1>0| ≤ 0.01 and |A0>0 −A1>1| ≤ 0.01. For the robust optimization, we implement this goal by maximizingmin{A0>1, A1>0} + min{A0>0, A1>1}. Table 4 gives the test ranking accuracy and pairwise fairness goal violations. Thepairwise fairness violations are measured as max{|A0>1 −A1>0|, |A0>0 −A1>1} for this experiment.

As expected the unconstrained algorithm yields the highest overall ranking objective, but incurs very high fairness violations.The debiased weighting approach also gives a similar performance (as the relative proportions of positives and negatives are thesame in expectation for the two protected groups since the protected group was independent of the label). Both the constrainedand robust optimization approaches significantly reduce the fairness violations. In terms of the ranking objective, constrainedoptimization suffers a considerable decrease in the overall objective when constraining all entries of the accuracy matrix, whereasrobust optimization incurs only a marginal decrease in objective.

Figure 3 shows the 2× 2 pairwise accuracy matrix for each method. From Figure 3(b), one sees that constrained optimizationsatisfies the fairness constraints by lowering the accuracies for all four group-pairs. In contrast, Figure 3(c), shows robustoptimization maximizes the minimum entry in the fairness matrix. These results are consistent with the two different optimizationproblems: you get what you ask for.

Figure 2 shows the hyperplanes (dashed line) for the ranking functions learned by the different methods. Note that the qualityof the learned ranking function depends on the slope of the hyperplane and is unaffected by its intercept. The hyperplane learnedby the unconstrained approach ranks the majority examples (the + group) well, but is not accurate at ranking the minorityexamples (the o group). The hyperplanes learned by the constrained and robust optimization methods work more equally well forboth groups.

Additional Results for Simulated Ranking Task with 3 Groups In Table 5, we provide the test ranking accuracy andpairwise fairness goal violations for the simulated ranking experiment with three groups, with a marginal equal opportunityfairness criterion.

Additional Singh and Joachims [2018] Comparison We also ran comparisons with the post-processing LP-based approachof Singh and Joachims [2018], with their proposed disparate treatment criterion as the constraint. The results are shown inTable 6. This approach again failed to produce feasible solutions for the Business matching dataset as it had a very small numberof pairs/query, and could not be applied to the Wiki Talk Pages and Communities & Crime datasets as they did not have therequired query-candidate structure. Because this approach seeks to satisfy a non-pairwise exposure-style fairness constraint, indoing so, it fails to perform well on our proposed pairwise metrics.

Table 6: Test AUC (higher is better) with test pairwise fairness violations (in parentheses) for comparisons with the approach ofSingh and Joachims [2018] that enforces disparate treatment. For fairness violations, we report |AG0>G1

−AG1>G0|.

Dataset Prot. Group Unconstrained Constrained S & J (DT)Sim. CG 0/1 0.92 (0.28) 0.86 (0.01) 0.84 (0.05)W3C Experts Gender 0.53 (0.96) 0.54 (0.10) 0.38 (0.92)

Chain Non-chainRow-avg 0.767 0.706

(a) Unconstrained


(b) DebiasedChain Non-chain

Row-avg 0.715 0.712

(c) Constrained


(d) Robust

Figure 4: Test row-based matrix averages on business match (ranking) data.

Non-expert

Exp

ert Male Female

Male 0.534 0.028Female 0.991 0.573

(a) Unconstrained

Non-expert

Exp

ert Male Female

Male 0.541 0.081Female 0.977 0.571

(b) DebiasedNon-expert

Exp

ert Male Female

Male 0.540 0.501Female 0.601 0.571

(c) Constrained/Cross-groups

Non-expert

Exp

ert Male Female

Male 0.540 0.471Female 0.614 0.553

(d) Robust/Cross-groups

Figure 5: Test pairwise accuracy matrix for W3C experts (ranking) data.

Low

Hig


Female 0.795 0.654

(a) Unconstrained

Low

Hig


Female 0.666 0.655

(b) Constrained/Cross-groups

Figure 6: Pairwise accuracy matrix for law school (regression).

A> A<

0.904 0.577

(a) Unconstrained

A> A<

0.777 0.745

(b) Constrained/Continuous Attr.

Figure 7: Continuous attribute pairwise accuracies for crime (regression) with %. of black population as protected attribute.

Pairwise Accuracy Matrices We also present the pairwise accuracy matrices for different methods applied to the Businessmatching task (see Figure 4), to the W3C experts ranking task (see Figure 5), to the Law School regression task (see Figure 6),and to the Crime regression task with continuous protected attribute (see Figure 7). In the case of the Business matchingresults, notice that robust optimization maximizes the minimum of the two marginals. Despite good marginal accuracies, robustoptimization’s overall accuracy is not as good because of poorer behavior on the examples that were not covered by the twogroups. Debiasing produces a negligible reduction in fairness violation, but yields better row marginals than the unconstrainedapproach.

pairwise fairness for ranking and regression · rank a french restaurant above a mexican...

Documents