independent ethical assessment of text classification

Independent Ethical Assessment of Text Classification Models: AHate Speech Detection Case Study

Amitoj Singh∗Emerging Technology PwC

PwC Acceleration Center, MumbaiIndia

Jingshu Chen∗PwC Acceleration Center ShanghaiPudong New District, Shanghai

China

Lihao Zhang∗PwC Acceleration Center ShanghaiPudong New District, Shanghai

China

Amin Rasekh†Product Innovation PwCWashington, DC, USA

Ilana GolbinEmerging Technology PwC

Los Angeles, CA, USA

Anand S. RaoGlobal AI Leader PwCBoston, MA, USA

ABSTRACTAn independent ethical assessment of an artificial intelligence sys-tem is an impartial examination of the system’s development, de-ployment, and use in alignment with ethical values. System-levelqualitative frameworks that describe high-level requirements andcomponent-level quantitative metrics that measure individual ethi-cal dimensions have been developed over the past few years. How-ever, there exists a gap between the two, which hinders the exe-cution of independent ethical assessments in practice. This studybridges this gap and designs a holistic independent ethical assess-ment process for a text classification model with a special focuson the task of hate speech detection. The assessment is furtheraugmented with protected attributes mining and counterfactual-based analysis to enhance bias assessment. It covers assessmentsof technical performance, data bias, embedding bias, classificationbias, and interpretability. The proposed process is demonstratedthrough an assessment of a deep hate speech detection model.

CCS CONCEPTS• Computing methodologies → Supervised learning by classifi-cation; Natural language processing.

KEYWORDSnatural language processing, classification, responsible AI, ethicalassessment, fairness

ACM Reference Format:Amitoj Singh, Jingshu Chen, Lihao Zhang, Amin Rasekh, Ilana Golbin,and Anand S. Rao. 2021. Independent Ethical Assessment of Text Classifica-tion Models: A Hate Speech Detection Case Study . In KDD ’21: SIGKDD

∗Authors contributed equally to this research.†Corresponding author: [email protected]

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected] ’21, August 14–18, 2021, Singapore© 2021 Association for Computing Machinery.ACM ISBN 978-1-4503-XXXX-X/18/06. . . $15.00https://doi.org/10.1145/1122445.1122456

International Conference on Knowledge Discovery and Data Mining, Au-gust 14–18, 2021, Singapore. ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/1122445.1122456

1 INTRODUCTIONInternal and external independent assessments of artificial intelli-gence (AI) systems are conducted to examine quality, responsibilityand accountability for their development, deployment, and use[14, 18]. Major ethical concerns featuring language models haverecently attracted a lot of attention for the need of accountability inthis domain for example: Microsoft’s AI Chatbot Tay tweeted racistscomments when trained on Twitter Data and GPT-3 AI generatedblog hit no. 1 on Hacker News which was completely fake, hinting amass misinformation threat. Independence in this context is definedas the freedom from conditions that may jeopardize the ability ofthe assessment activity to fulfill assessment responsibilities in animpartial manner [15]. A three-level classification of independentassessments is provided in Appendix A.

An independent ethical assessment requires both system-levelqualitative guidelines that describe the process and requirements aswell as component-level quantitative tools that enable the calcula-tion of assessment metrics related to individual ethical dimensions.The Assessment List for Trustworthy Artificial Intelligence [18] andthe Artificial Intelligence Auditing Framework [16] are examples ofsuch qualitative guidelines, amongst others [1, 17, 22]. Quantitativemodels that focus on different ethical components have also beendeveloped, such as the techniques for bias detection in embeddingmodels by [7, 19, 25] and the metrics for evaluating bias in textclassification datasets and models by [2, 5, 9, 20, 27]. Quantitativemeasures require desired metrics, which must be defined by thecontext, not by the framework.

There exists a gap between system-level qualitative guidelinesand component-level quantitative models that hinders execution ofindependent ethical assessments in practice [3]. This study bridgesthis gap for text classification systems in the context of hate speechdetection models. While hate speech detection aims to serve thepositive cause of reducing hateful content, it also poses the risk ofsilencing harmless content, or even worse, doing so with a falsepositive bias and without an ability to provide an explanation of itsreasoning.

We summarize our main contributions as follows:

arX

iv:2

108.

0762

7v1

[cs

.CY

] 1

9 Ju

l 202

1

https://doi.org/10.1145/1122445.1122456

https://doi.org/10.1145/1122445.1122456

https://doi.org/10.1145/1122445.1122456

KDD ’21, August 14–18, 2021, Singapore Amitoj and Chen, et al.

• We develop a process for independent ethical assessments oftext classification systems by bridging high-level qualitativeguidelines and low-level technical models.

• We propose methods to mine protected attributes from un-structured data.

• We demonstrate approaches and metrics for quantifying biasin data, word embeddings, and classification models.

The paper is organized as follows. Section 2 presents the pro-posed assessment process. Section 3 describes the text classificationsystem under consideration and its technical assessment. Section 4presents the data bias assessment using two different approaches.Section 5 describes the embedding bias metrics and assessment. Sec-tion 6 presents the classification bias assessment using the original,swapped, and synthetic datasets. Section 7 describes the local andglobal interpretability assessment. Section 8 concludes the study.

2 PROCESS DESCRIPTIONThis process performs quantitative assessments to address the qual-itative requirements set by the Assessment List for TrustworthyArtificial Intelligence (ALTAI) by the European Commission [18].The classification accuracy assessment addresses Requirement #2Technical Robustness and Safety. Bias assessments are conductedfor the data, word embeddings, and classification model, whichcorrespond to Requirement #5 Diversity, Non-discrimination andFairness. ALTAI defines bias as “systematic and repeatable errorsin a computer system that create unfair outcomes, such as favoringone arbitrary group of users over others.” Data and embedding biasare possible causes of unfairness, and may influence classificationbias, which is an actual measure of fairness. Protected attributes aredefined as those qualities, traits, or characteristics, such as gender,that cannot be discriminated against. Model interpretability as-sessments are performed to address Requirement #4 Transparency.Since the AI system under consideration is not computationallyintensive and was trained on a laptop, its environmental impactis deemed to be negligible. Nevertheless, Requirement #6 Societaland Environmental Well-being is briefly addressed in AppendixI. Indeed, this process does not cover all requirements and futurework is necessary to address the remainder.

3 THE AI SYSTEM AND TECHNICALASSESSMENT

The system under assessment is a pre-trained natural languageprocessing (NLP) system that identifies hateful comments on asocial media platform. The system has been trained on a dataset ofTwitter posts in English and classifies comments/posts as hateful ornon-hateful. It is a proof of concept model that was developed by aseparate team within the same organization. Information about thedataset and model are provided in Appendix B.

Requirement #2 in ALTAI states that reliable technical perfor-mance of the system is one of the critical requirements for trust-worthy AI. We assessed the performance of the model on the testdataset using various metrics. The results are presented in Table 1and the accuracy of the model is 88%.

Table 1: Technical performance assessment of the AI system.Macro and weighted averages are calculated through averag-ing the unweighted mean and the support-weighted meanper label, respectively.

Precision Recall F1-Score Support

Not-hateful 0.78 0.92 0.84 1,031Hateful 0.95 0.87 0.91 2,502

Macro Avg. 0.87 0.89 0.87 3,803Weighted Avg. 0.89 0.88 0.88 3,803

4 DATA BIAS ASSESSMENTRequirement #5 in ALTAI outlines the need for an assessment ofthe input data for bias. This data bias assessment step is motivatedby the possible impact that data bias may have on the fairness of theclassification model. In the case of structured datasets, correlationsbetween protected attribute features and decision variables canbe analyzed to assess data bias. For unstructured data like text,however, this approach may not be possible because protectedattributes are not represented as separate, standalone features. Thisstudy uses two approaches to tackle this problem and assess biasin the training data.

In the first approach, which is inspired by [5], a curated setof identity terms are first identified and their frequencies acrosshateful and all the comments in the training dataset are then foundand compared. The results for all the identity terms are providedin Appendix C. They indicate that some of identity terms such as“gay” and “white” are disproportionately used in hateful commentsand this could be an indicative of false positive bias.

The second approach is based on the frequency of protectedattribute references across hateful comments and overall. Protectedattribute mining is first conducted to identify references made toeach subgroup of each attribute in the comments (as elaboratedupon in Section 6.2 below). Their frequencies across the hatefulcomments and all the comments is then calculated and compared.The results are shown in Table 2 for the protected attributes ofgender and religion. The data shows some bias is present in thedata such as for the “Islam” subgroup and the "female" subgroup.The word Islam, in our dataset, has a larger presence in the hatefulcomments as compared word Christianity. Same holds true for theword female compared to the word male.

Table 2: Frequency of protected attribute references in hate-ful and not-hateful comments and overall for the protectedattributes of gender and religion

ProtectedAttribute

Subgroup Hateful Not-hateful Overall

Gender Female 12.28% 11.53% 12.02%Male 11.51% 17.37% 13.52%

Religion Islam 0.10% 0.15% 0.12%Christianity 0.07% 0.24% 0.13%

Independent Ethical Assessment of Text Classification Models: A Hate Speech Detection Case Study KDD ’21, August 14–18, 2021, Singapore

5 EMBEDDING BIAS ASSESSMENTThe embedding model should be assessed for bias as it influencesthe classification model’s decisions, which relates to Requirement#5 in ALTAI. This embedding bias assessment is motivated by thepotential influence that embedding bias has on the fairness of theclassification model. To assess embedding bias, we analyze simi-larity between a diverse set of 501 neutral words (e.g., “admirable”and “miserable”) [7] and word groups associated with three pro-tected attributes of gender, religion, and ethnicity (Appendix D).Neutral words mean they are expected to be similarly correlatedacross protected attribute words [7]. The similarity metric used iscosine similarity between the individual terms’ embeddings and isaggregated using average mean absolute error (AMAE) and averageroot mean squared error (ARMSE).

Given a protected attribute (e.g., gender) and a subgroup within(e.g., female), the similarity between an individual neutral word andthat subgroup is calculated by averaging distances for the individualneutral word and subgroup word, such as, for example, “nice” and“she”. This is done for all the neutral words and subgroups. Thenthe AMAE and ARMSE of the average cosine similarity across thesubgroups is calculated (the equations are provided in Appendix E).This is performed for all the protected attributes and the results areillustrated in Figure 1. The magnitudes of the AMAE and ARMSEshow how strong the biases are with that particular category. Inprincipal, the 2 measures are similar to the statistical measure of"variance" of cosine-distances across the various sub-groups in acategory. Therefore, higher the measure, greater is the disparity inassociation and more biased are the embeddings for that category.

Figure 1: Embedding bias magnitude across different pro-tected attributes

The AMAE and ARMSE scores indicate that the embeddingmodel is more biased as it relates to religion terms but less forgender terms. Appendix E provides a comparison of the bias magni-tude for this system’s embedding model against other mainstreamembedding models.

6 CLASSIFICATION BIAS ASSESSMENTWhile classification accuracy at an aggregate level is related to AL-TAI Requirement #2 Technical Robustness and Safety, an accuracyreport broken down by various protected attributes would addressRequirement #5 Diversity, Non-discrimination and Fairness. In con-trast to data and embedding bias, classification bias may directlyimpact the system’s stakeholders.

To assess this bias, subgroups based on gender, ethnicity, religion,etc. are first identified in different comments. Bias assessment isthen performed to understand how the model’s prediction accuracy

differs across these protected attributes. This assessment can beenhanced by measuring changes in the model’s accuracy as thesubgroups in the comments are swapped (for example, does themodel’s true positive prediction for "men are universally terrible"changes to a false negative if "men" is swapped with "women"?).Synthetic sentences containing protected attribute references arealso generated and analyzed for a better understanding of how themodel learns biases.

6.1 Protected Attributes MiningTo extract protected attributes from text, two approaches are appliedin this study: 1) a look-up based approach and 2) a named-entityrecognition (NER) based approach. These approaches have beenvalidated using datasets with human-annotated comments [5]. Thelook-up approach is based on the word counts extracted from thetext examples and the words are looked up in a list of commonidentity terms such as pronouns and professions. The NER basedapproach uses spaCy’s pre-trained NER model to extract entities(i.e. protected attributes). A named entity is a “real-world object”with a type associated with it, which might be a country, a product,a religion, etc. In this study, entities tagged with the “NORP” typeare extracted. “NORP” refers to the “nationalities or religious orpolitical groups”. spaCy’s NER model takes into account the contextand part of speech to determine the tag for a given word. Genderand religion are the protected attributes addressed in this subsection.Several examples of protected attribute extractions from commentsare included in Appendix G.

6.2 Bias Assessment based on ProtectedAttributes Mining

The PwC Responsible AI (RAI) Toolkit performs a bias assessmentusing protected attribute references, ground-truth labels, and pre-diction probabilities [23]. It checks whether the model discriminatesagainst various groups or individuals using a variety of bias met-rics. Table 3 and Figure 2 show the results for gender. There were501 and 442 references to male and female subgroups, respectively.Appendix F provides a description of the seven bias metrics used.

Table 3: Average prediction probabilities for comments withgender references

Actual Comment Type Subgroup Avg. Predicted Probability

Not-hateful Female 14.0% hatefulNot-hateful Male 15.4% hatefulHateful Female 85.6% hatefulHateful Male 80.4% hateful

6.3 Assessment Based on Swapped ProtectedAttributes

Versions of the comments where identity terms are swapped helpaddress the limitations of relying solely on the ground-truth datasuch as in the case of disproportionate representation of one sub-group over another. Examples are provided in G.


Figure 2: Bias assessment based on protected attribute ex-traction for gender

A bias assessment can be performed using the actual label andthe predicted probabilities before and after identity swapping. Wedefine the term ’favor’ as the model’s confidence in its predictionone way or the other. For not-hateful comments, a decrease in theprediction probability after swapping identities (e.g. male to female)means the model favors the new identity. For hateful comments,an increase in the prediction probability after swapping identities(e.g. male to female) means the model favors the new identity. Thisprocess of swapping is repeated for all the comments with theprotected attribute references. This bias assessment indicated thatfor 55.9% of the comments the model favored the female subgroup,while in 43.4% of the comments it favored the male subgroup. Themodel prediction did not change for 0.7% of the comments withprobabilities rounded to four decimal places.

6.4 Bias Assessment Based on SyntheticCounterfactual Dataset

Bias assessment can also be performed using a synthetic counter-factual dataset. This method offers greater flexibility but is labor-intensive and the resulting dataset may not be representative of theactual texts the model encounters in practice. This idea is relatedto counterfactual fairness, which as stated by [11], “captures theintuition that a decision is fair towards an individual if it is the samein (a) the actual world and (b) a counterfactual world where the

individual belonged to a different demographic group”. Examplesof such counterfactuals for religion are provided in Appendix G.

Once the synthetic counterfactual dataset is generated, a biasassessment, similar to the above swapped dataset assessment, can beconducted. We used several templates (some inspired from [5]) forgender and religion. The average prediction probabilities (AppendixG), indicate that there appears to be a bias towards treating all talkof Islam as more hateful compared to Christianity.

We use the following threshold-insensitive counterfactual biasmetric for each reference subgroup:

𝐶𝐵(X,𝑂, y) =|X |∑︁𝑖=1

(𝑝xi −1|𝑂 |

|𝑂 |∑︁𝑗=1

𝑝𝑂𝑖 𝑗) × 𝑔(𝑦𝑖 ) (1)

where X is a series of reference group (e.g. female) examples (e.g.("she is kind", "women are universally terrible", . . .). 𝑂 is a seriesof sets of counterfactual examples of the form (("he is kind", . . .),("men are universally terrible", . . .)). 𝑝x𝑖 = 𝑓 (x𝑖 ) (i.e. the model’spredicted probability). y is a series of labels for the reference groupexamples, where 0s correspond to not-hateful comments and 1scorresponds to hateful comments. 𝑔(𝑦𝑖 ) is -1 when 𝑦𝑖 = 0 and 1when 𝑦𝑖 = 1 and is a term that corrects for the direction of the bias.For the model under consideration, the preferred label is not-hatefuland therefore 𝑔(𝑦) is -1 for not-hateful comments. A positive valueof 𝐶𝐵 means the model favors the reference subgroup.

Setting “Islam” as the reference subgroup returns a bias value of-0.0067, which denotes the model slightly favors the “Christianity”subgroup. For gender, setting “Male” as the reference subgroupwould return a bias value of 0.0737. A comprehensive assessmentmay also include other bias metrics such as equalized odds [8].

7 CLASSIFICATION INTERPRETABILITYASSESSMENT

Requirement #4 in ALTAI focuses on transparency requirementsand describes explainability as one of the elements to consideralongside traceability and open communication. The question issimple - why are certain tweets being classified as hate-speech?Understanding hate in speech is a function of the vocabulary andthe context of the language used and we as humans have an innateability to understand hate. We seek to understand how an AI modelinfers it. Does it consider the vocabulary? If yes, how does it do sosince we have not explicitly fed the model a list of abusive words orprofane phrases. Does it understand the context? If yes, how does itdistinguish between cases where strong words can be used as a wayof free-speech or expression (e.g. “I am Gay” ) versus cases wherethe words are used disparagingly as hateful comments directed atsomeone? We explored two methods to address these questions.Overall, the two explainability packages explored offer end-users aunique perspective to tie model decisions with the vocabulary usedin a particular tweet. This may help automate, at least partially,some of the content modulation tasks Social Media players aretrying to do and at the same time, provide powerful measurableevidence for driving hate.

The first method focuses on local interpretability and uses LIME,which is a model agnostic method that helps explain a given pre-diction through the features used in the model [24]. In the case of


a text classification model, LIME assigns each token in a given sen-tence an importance score which can either be positive or negative,depending on the word’s effect on the prediction. For example, inthe tweet shown in Figure 3, LIME shows that words like “vote” and“liberals” contribute to the model predicting non-hate, while wordslike “filthy” and “sick” push the model to classify as a hate-comment.LIME can be used to look at incidents of hate tweets as shown inFig 3. to understand how the AI model determines hate. This can beof important use for the en-user to better analyze the classificationand what kind of vocabulary is too strong for the model to predictthe comment as non-hateful. Here the use of strong-language wasthe key for the model to predict hate.

Figure 3: Local interpretability of an example prediction us-ing LIME

The second method concerns global interpretability and usesSHAP values. SHAP is a model-agnostic explainability tool thatgenerates feature importances using the overall datasets. It producesvarious permutations over feature selection to get an estimate ofthe importance of a particular feature on the overall model [13].The global interpretability results are provided in Appendix H. Theresults show at an overall level that strong-language is what drivesmodel in predicting hate. The keywords listed here had the largestimpact on model’s detection of hate-speech.

8 CONCLUSIONSThis study proposed a process for independent ethical assessmentsof text classification systems and demonstrated it through a first-party assessment of a hate speech detectionmodel.While the ALTAIrecommendations are largely applicable to text classification sys-tems, they are underspecified, leaving a looming implementationgap. This study addresses that gap through a series of component-wise quantitative assessments. A discussion of the other require-ments with ideas for addressing them are provided in the AppendixI. Risk mitigation is also a logical next step to any risk assessmentprocess. This work serves as a concrete blueprint for the designand implementation of independent ethical assessment processesfor text classification systems. More work is needed to apply theseideas to other domains.

ACKNOWLEDGMENTSWewish to thank Todd Morrill for thorough feedback on this article.

REFERENCES[1] Information Systems Audit and Control Association (ISACA). 2018. Auditing Ar-

tificial Intelligence. https://ec.europa.eu/futurium/en/system/files/ged/auditing-artificial-intelligence.pdf

[2] Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Hanna Wallach. 2020. Lan-guage (technology) is power: A critical survey of" bias" in nlp. arXiv preprintarXiv:2005.14050 (2020).

[3] Shea Brown, Jovana Davidovic, and Ali Hasan. 2021. The algorithm audit: Scoringthe algorithms that score us. Big Data & Society 8, 1 (2021), 2053951720983865.

[4] Thomas Davidson, Dana Warmsley, Michael Macy, and Ingmar Weber. 2017.Automated hate speech detection and the problem of offensive language. InProceedings of the International AAAI Conference onWeb and Social Media, Vol. 11.

[5] Lucas Dixon, John Li, Jeffrey Sorensen, Nithum Thain, and Lucy Vasserman. 2018.Measuring and mitigating unintended bias in text classification. In Proceedings ofthe 2018 AAAI/ACM Conference on AI, Ethics, and Society. 67–73.

[6] eurostat Statistics Explained. 2017. Glossary:Carbon dioxide equivalent. RetrievedMay 19, 2021 from https://ec.europa.eu/eurostat/statistics-explained/index.php?title=Glossary:Carbon_dioxide_equivalent#:~:text=A%20carbon%20dioxide%20equivalent%20or,with%20the%20same%20global%20warming

[7] Nikhil Garg, Londa Schiebinger, Dan Jurafsky, and James Zou. 2018. Wordembeddings quantify 100 years of gender and ethnic stereotypes. Proceedings ofthe National Academy of Sciences 115, 16 (2018), E3635–E3644.

[8] Moritz Hardt, Eric Price, and Nathan Srebro. 2016. Equality of opportunity insupervised learning. arXiv preprint arXiv:1610.02413 (2016).

[9] Ben Hutchinson, Vinodkumar Prabhakaran, Emily Denton, Kellie Webster, YuZhong, and Stephen Denuyl. 2020. Social biases in NLP models as barriers forpersons with disabilities. arXiv preprint arXiv:2005.00813 (2020).

[10] Emre Kazim, Danielle Mendes Thame Denny, and Adriano Koshiyama. 2021. AIauditing and impact assessment: according to the UK information commissioner’soffice. AI and Ethics (2021), 1–10.

[11] Matt J Kusner, Joshua R Loftus, Chris Russell, and Ricardo Silva. 2017. Counter-factual fairness. arXiv preprint arXiv:1703.06856 (2017).

[12] Alexandre Lacoste, Alexandra Luccioni, Victor Schmidt, and Thomas Dandres.2019. Quantifying the carbon emissions of machine learning. arXiv preprintarXiv:1910.09700 (2019).

[13] Scott Lundberg and Su-In Lee. 2017. A unified approach to interpreting modelpredictions. arXiv preprint arXiv:1705.07874 (2017).

[14] T Madiega. 2019. EU guidelines on ethics in artificial intelligence: Context andimplementation. European Parliamentary Research Service, PE 640 (2019).

[15] Institute of Internal Auditors. 2021. International Professional Practices Framework(IPPF). Institute of Internal Auditors.

[16] The Institute of Internal Auditors. 2017. Artificial Intelligence Auditing Frame-work. https://na.theiia.org/periodicals/Public%20Documents/GPI-Artificial-Intelligence-Part-II.pdf

[17] Information Commissioner’s Office. 2020. Guidance on the AI auditing frame-work. https://ico.org.uk/media/about-the-ico/consultations/2617219/guidance-on-the-ai-auditing-framework-draft-for-consultation.pdf

[18] High-Level Expert Group on Artificial Intelligence. 2020. Assessment List forTrustworthy Artificial Intelligence (ALTAI). https://doi.org/10.2759/791819

[19] Orestis Papakyriakopoulos, Simon Hegelich, Juan Carlos Medina Serrano, andFabienne Marco. 2020. Bias in word embeddings. In Proceedings of the 2020Conference on Fairness, Accountability, and Transparency. 446–457.

[20] Ji Ho Park, Jamin Shin, and Pascale Fung. 2018. Reducing gender bias in abusivelanguage detection. arXiv preprint arXiv:1808.07231 (2018).

[21] Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove:Global vectors for word representation. In Proceedings of the 2014 conference onempirical methods in natural language processing (EMNLP). 1532–1543.

[22] Inioluwa Deborah Raji, Andrew Smart, Rebecca N White, Margaret Mitchell,Timnit Gebru, Ben Hutchinson, Jamila Smith-Loud, Daniel Theron, and ParkerBarnes. 2020. Closing the AI accountability gap: defining an end-to-end frame-work for internal algorithmic auditing. In Proceedings of the 2020 Conference onFairness, Accountability, and Transparency. 33–44.

[23] A Rao, F Palaci, and W Chow. 2019. A practical guide to Responsible ArtificialIntelligence (AI). Technical Report. PwC.

[24] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. " Why should itrust you?" Explaining the predictions of any classifier. In Proceedings of the 22ndACM SIGKDD international conference on knowledge discovery and data mining.1135–1144.

[25] David Rozado. 2020. Wide range screening of algorithmic bias in word embeddingmodels using large sentiment lexicons reveals underreported bias types. PloS one15, 4 (2020), e0231189.

[26] JP Russell. 2019. The ASQ auditing handbook. ASQ Quality Press.[27] Maarten Sap, Dallas Card, Saadia Gabriel, Yejin Choi, and Noah A Smith. 2019.

The risk of racial bias in hate speech detection. In Proceedings of the 57th annualmeeting of the association for computational linguistics. 1668–1678.

[28] Sandra Wachter, Brent Mittelstadt, and Chris Russell. 2021. Why fairness cannotbe automated: Bridging the gap between EU non-discrimination law and AI.

https://ec.europa.eu/futurium/en/system/files/ged/auditing-artificial-intelligence.pdf

https://ec.europa.eu/futurium/en/system/files/ged/auditing-artificial-intelligence.pdf

https://ec.europa.eu/eurostat/statistics-explained/index.php?title=Glossary:Carbon_dioxide_equivalent#:~:text=A%20carbon%20dioxide%20equivalent%20or,with%20the%20same%20global%20warming



https://na.theiia.org/periodicals/Public%20Documents/GPI-Artificial-Intelligence-Part-II.pdf

https://na.theiia.org/periodicals/Public%20Documents/GPI-Artificial-Intelligence-Part-II.pdf

https://ico.org.uk/media/about-the-ico/consultations/2617219/guidance-on-the-ai-auditing-framework-draft-for-consultation.pdf

https://ico.org.uk/media/about-the-ico/consultations/2617219/guidance-on-the-ai-auditing-framework-draft-for-consultation.pdf

https://doi.org/10.2759/791819


Computer Law & Security Review 41 (2021), 105567.[29] Marcos Zampieri, Shervin Malmasi, Preslav Nakov, Sara Rosenthal, Noura Farra,

and Ritesh Kumar. 2019. Predicting the type and target of offensive posts insocial media. arXiv preprint arXiv:1902.09666 (2019).

[30] Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang.2018. Gender Bias in Coreference Resolution: Evaluation and Debiasing Meth-ods. In Proceedings of the 2018 Conference of the North American Chapter of theAssociation for Computational Linguistics: Human Language Technologies, Vol. 2.

A A THREE-LEVEL INDEPENDENTASSESSMENT CLASSIFICATION

The three-level classification that is common in the audit and as-surance landscape [26] can be adapted.

A first-party ethical assessment of an AI system is an internalassessment conducted by a team within the same organization thatowns the AI system but who had no role in its development norhas any vested interest in its assessment results.

A second-party assessment is an external assessment conductedby a contracted party outside the organization on behalf of a cus-tomer.

A third-party assessment is an external assessment by a partyindependent of the organization-customer relationship and assessesthe level of conformity of the AI system to certain ethical criteriaand standards [26].

The AI system development team and the AI system assessmentteam are the two primary groups involved in the process for a first-party ethical assessment. The two teams are independent of eachother, although they are still within the same organization. TheAI system development team is analogous to the first and secondlines in the IIA’s Three Line Model [15], which are the creators,executors, operators, managers, and supervisors who are in chargeof designing, building, and deploying the AI system as well as riskassessment and strategy. The AI system assessment team is relatedclosely to the third line, which are the auditors and ethicists whoassess the system against the ethical requirements.

B DATASET AND MODEL DESCRIPTIONThe text dataset: The dataset has been created by combining twopublic datasets from [4, 29] and used for the purpose of trainingthe hate speech classifier. The model was trained on 34,220 Tweetsand an unseen test dataset of 3803 Tweets was also provided to us.

The word embedding model: The classifier uses the pretrained Stan-ford Glove Common Crawl word embeddings with 300 dimensions[21] for training and predictions.

The classification model: The classifier is a convolutional neuralnetwork model that takes word embeddings as input. The networkarchitecture is a 1DCNN (32 filters with a kernel size of 17), followedby a max pooling (pool size =4) layer and then two fully connecteddense Layers (25 unit output followed by a 1 unit output). Theassessment here uses the pretrained model as is (i.e., only to makepredictions where needed) and the whole process is repeatableregardless of knowing the architecture.

C FREQUENCY OF IDENTITY TERMS INDATA

The statistics are based on the training dataset and are provided inTable 4. It contains three columns: (1) the % of comments labelledas “Hateful” that contain an identity term, (2) the % of commentslabelled as “Not-hateful” that contain an identity term, and (3) the% of comments Overall (“Hateful” and “Not-hateful”) that containan identity term.

Table 4: Frequency of identity terms in hateful and not-hateful comments and overall

Term Hateful % Not-hateful % Overall %

atheist 0.0044 0.0085 0.0058queer 0.2220 0.0427 0.1607gay 0.3864 0.1282 0.2981

transgender 0.0044 0.0085 0.0058lesbian 0.0266 0.0256 0.0263

homosexual 0.0133 0.0085 0.0117feminist 0.0222 0.0000 0.0146black 0.6706 0.5042 0.6137white 1.3367 0.5640 1.0725

heterosexual 0.0044 0.0000 0.0029islam 0.0044 0.0000 0.0029muslim 0.0178 0.0171 0.0175

D PROTECTED ATTRIBUTE WORD GROUPSD.1 Word groups for three protected attributes

(based on [7, 30])Religion:

Islam: ’allah’, ’ramadan’, ’turban’, ’emir’, ’salaam’, ’sunni’, ’ko-ran’, ’imam’, ’sultan’, ’prophet’, ’veil’, ’ayatollah’, ’shiite’, ’mosque’,’islam’, ’sheik’, ’muslim’, ’muhammad’;

Christianity: ’baptism’, ’messiah’, ’catholicism’, ’resurrection’,’christianity’, ’salvation’, ’protestant’, ’gospel’, ’trinity’, ’jesus’, ’christ’,’christian’, ’cross’, ’catholic’, ’church’, ’christians’, ’catholics’

Gender:Male: ’cowboy’, ’cowboys’, ’cameramen’, ’cameraman’, ’busboy’,

’busboys’, ’bellboy’, ’bellboys’, ’barman’, ’barmen’, ’tailor’, ’tailors’,’prince’, ’princes’, ’governor’, ’governors’, ’adultor’, ’adultors’, ’god’,’gods’, ’host’, ’hosts’, ’abbot’, ’abbots’, ’actor’, ’actors’, ’bachelor’,’bachelors’, ’baron’, ’barons’, ’beau’, ’beaus’, ’bridegroom’, ’bride-grooms’, ’brother’, ’brothers’, ’duke’, ’dukes’, ’emperor’, ’emperors’,’enchanter’, ’father’, ’fathers’, ’fiance’, ’fiances’, ’priest’, ’priests’,’gentleman’, ’gentlemen’, ’grandfather’, ’grandfathers’, ’headmas-ter’, ’headmasters’, ’hero’, ’heros’, ’lad’, ’lads’, ’landlord’, ’landlords’,’male’, ’males’, ’man’, ’men’, ’manservant’, ’manservants’, ’mar-quis’, ’masseur’, ’masseurs’, ’master’, ’masters’, ’monk’, ’monks’,’nephew’, ’nephews’, ’priest’, ’priests’, ’sorcerer’, ’sorcerers’, ’step-father’, ’stepfathers’, ’stepson’, ’stepsons’, ’steward’, ’stewards’, ’un-cle’, ’uncles’, ’waiter’, ’waiters’, ’widower’, ’widowers’, ’wizard’,’wizards’, ’airman’, ’airmen’, ’boy’, ’boys’, ’groom’, ’grooms’, ’busi-nessman’, ’businessmen’, ’chairman’, ’chairmen’, ’dude’, ’dudes’,


’dad’, ’dads’, ’daddy’, ’daddies’, ’son’, ’sons’, ’guy’, ’guys’, ’grandson’,’grandsons’, ’guy’, ’guys’, ’he’, ’himself’, ’him’, ’his’, ’husband’, ’hus-bands’, ’king’, ’kings’, ’lord’, ’lords’, ’sir’, ’sir’, ’mr.’, ’mr.’, ’policeman’,’spokesman’, ’spokesmen’;

Female: ’cowgirl’, ’cowgirls’, ’camerawomen’, ’camerawoman’,’busgirl’, ’busgirls’, ’bellgirl’, ’bellgirls’, ’barwoman’, ’barwomen’,’seamstress’, ’seamstress’, ’princess’, ’princesses’, ’governess’, ’gov-ernesses’, ’adultress’, ’adultresses’, ’godess’, ’godesses’, ’hostess’,’hostesses’, ’abbess’, ’abbesses’, ’actress’, ’actresses’, ’spinster’, ’spin-sters’, ’baroness’, ’barnoesses’, ’belle’, ’belles’, ’bride’, ’brides’, ’sis-ter’, ’sisters’, ’duchess’, ’duchesses’, ’empress’, ’empresses’, ’enchantress’,’mother’, ’mothers’, ’fiancee’, ’fiancees’, ’nun’, ’nuns’, ’lady’, ’ladies’,’grandmother’, ’grandmothers’, ’headmistress’, ’headmistresses’,’heroine’, ’heroines’, ’lass’, ’lasses’, ’landlady’, ’landladies’, ’female’,’females’, ’woman’, ’women’, ’maidservant’, ’maidservants’, ’mar-chioness’, ’masseuse’, ’masseuses’, ’mistress’, ’mistresses’, ’nun’,’nuns’, ’niece’, ’nieces’, ’priestess’, ’priestesses’, ’sorceress’, ’sorcer-esses’, ’stepmother’, ’stepmothers’, ’stepdaughter’, ’stepdaughters’,’stewardess’, ’stewardesses’, ’aunt’, ’aunts’, ’waitress’, ’waitresses’,’widow’, ’widows’, ’witch’, ’witches’, ’airwoman’, ’airwomen’, ’girl’,’girls’, ’bride’, ’brides’, ’businesswoman’, ’businesswomen’, ’chair-woman’, ’chairwomen’, ’chick’, ’chicks’, ’mom’, ’moms’, ’mommy’,’mommies’, ’daughter’, ’daughters’, ’gal’, ’gals’, ’granddaughter’,’granddaughters’, ’girl’, ’girls’, ’she’, ’herself’, ’her’, ’her’, ’wife’,’wives’, ’queen’, ’queens’, ’lady’, ’ladies’, "ma’am", ’miss’, ’mrs.’, ’ms.’,’policewoman’, ’spokeswoman’, ’spokeswomen’

Ethnicity:Chinese: ’chung’, ’liu’, ’wong’, ’huang’, ’ng’, ’hu’, ’chu’, ’chen’,

’lin’, ’liang’, ’wang’, ’wu’, ’yang’, ’tang’, ’chang’, ’hong’, ’li’;Hispanic: ’ruiz’, ’alvarez’, ’vargas’, ’castillo’, ’gomez’, ’soto’, ’gon-

zalez’, ’sanchez’, ’rivera’, ’mendoza’, ’martinez’, ’torres’, ’rodriguez’,’perez’, ’lopez’, ’medina’, ’diaz’, ’garcia’, ’castro’, ’cruz’;

White: ’harris’, ’nelson’, ’robinson’, ’thompson’, ’moore’, ’wright’,’anderson’, ’clark’, ’jackson’, ’taylor’, ’scott’, ’davis’, ’allen’, ’adams’,’lewis’, ’williams’, ’jones’, ’wilson’, ’martin’, ’johnson’

E EMBEDDING BIAS METRICS AND VALUESLetM ∈ R𝑚×𝑑 be a matrix representing 𝑑 dimensional word embed-dings for the𝑚 neutral terms (e.g. crazy, nice, etc.). Let T𝑖 ∈ R𝑘𝑖×𝑑be a matrix that corresponds to the 𝑑 dimensional word embed-dings for the 𝑘𝑖 terms that represent subgroup 𝑖 (e.g. 𝑘1 = 2, where1 represents the male subgroup, which has terms {he, boy}). Letx𝑖 ∈ R𝑚 be the vector representing the averaged cosine similarityscores between the𝑚 neutral terms and the 𝑘𝑖 subgroup terms forsubgroup 𝑖 . We compute the 𝑗𝑡ℎ entry of x𝑖 as follows.

𝑥𝑖 𝑗 =1𝑘𝑖

𝑘𝑖∑︁ℓ=1

𝐶𝑂𝑆𝐼𝑁𝐸𝑆𝐼𝑀 (𝑚 𝑗 ,T𝑖ℓ ) (2)

We measure deviations between subgroups (e.g. male vs female) asproxies for word embedding bias. These measures include meanabsolute error (MAE), root mean squared error (RMSE), and severalvariations. When there are two subgroups being compared wecompute

𝑀𝐴𝐸 =1𝑚

𝑚∑︁𝑗=1

|𝑥1𝑗 − 𝑥2𝑗 | (3)

In the event that there are more than two subgroups being com-pared, we compute the average MAE (AMAE) score across all pair-wise comparisons between subgroups.

𝐴𝑀𝐴𝐸 =1(𝑠2) 𝑠∑︁𝑘=1

𝑠∑︁𝑛=𝑘+1

𝑀𝐴𝐸 (4)

where 𝑠 is the number of subgroups. Similarly, we compute theRMSE and averaged RMSE (ARMSE) as follows.

𝑅𝑀𝑆𝐸 =

√︄∑𝑚𝑗=1 (𝑥1𝑗 − 𝑥2𝑗 )2

𝑚(5)

𝐴𝑅𝑀𝑆𝐸 =1(𝑠2) 𝑠∑︁𝑘=1

𝑠∑︁𝑛=𝑘+1

𝑅𝑀𝑆𝐸 (6)

Figure 4: AMAE magnitude across different protected at-tributes and word embedding models

Figure 5: ARMSE magnitude across different protected at-tributes and word embedding models


F CLASSIFICATION BIAS METRICDEFINITIONS

Equal Opportunity: It measures the difference between true positiverates (proportion of samples correctly classified into positive class)for reference and protected groups.

Gini Equality: It measures the difference between the Gini Inequal-ity of benefits, which are defined as: Benefits = Prediction - Target+ 1

Normalized Treatment Equality: It measures the difference betweenFalse Negative - False Positive ratio in reference and protectedgroups.

Overall Accuracy Equality: It measures the difference between ac-curacy in reference and protected groups.

Positive Predictive Value: It measures the difference between positivepredictive values (proportion of positive samples among all samplesclassified as positive) for reference and protected groups.

Positive Class Balance: It measures the difference between averagepredicted probability, given predicted class is positive, for referenceand protected groups.

Statistical Parity: It measures the difference between positive rates(proportion of samples classified into the positive class) for refer-ence and protected groups.

The choice of metrics depends on which type of error the userassigns more penalty to, and therefore wants to equalize acrossdifferent groups. For example: Equal Opportunity means equalizingthe True Positive Rate (TPR) across the different groups. If TPR isa criterion the user really cares about, then Equal Opportunity isa suitable fairness metric to consider. If the user cares more aboutthe probability of getting the favourable result, the user shouldconsider the Statistical Parity. Therefore, the selection of the metricdepends on the user’s needs and preferences. Automation of thisprocess remains a challenge [28].

G CLASSIFICATION BIAS SUPPLEMENTARYDATA AND RESULTS

Tables 5-9 provide supplementary results for classification biasassessment.

.

H GLOBAL INTERPRETABILITY RESULTSThe global interpretability of predictions using SHAP provides aholistic view of how a given token impacts the overall workingof the model. The results are illustrated in Figure 6. The variousdots in this plot signify the presence of a token across differentsets of comments. The vertical location shows what token is beingdepicted, the color shows whether that token’s importance washigh or low for that observation, and the horizontal location showswhether the effect of that value caused a higher or lower prediction

Table 5: Examples of the structured dataset entries gener-ated through protected attribute mining. Gender and reli-gion are extracted using the look-up and NER based meth-ods, respectively

Comment GenderExtracted

ReligionExtracted

A visit to the DC Holocaust Mu-seum revealed Hitler won by 43%of the popular vote and 32% of theseats. He also used the Schutzstaffel(SS) to intimidate his opponents, re-minding one of the Antifa thugswe have today. Hitler also got theMuslims on his side. Uncanny.

Male Islam

"Seton Catholic where their ownstudents talk trash about how lowof a division there football team isin."

- Christianity

How to develop their competitiveskills and social skills in HomeSchooling...as my child too com-plain abt bullying and unjust be-haviour of teachers since she is su-per active...

Female -

Table 6: Swapped comments examples

Version Comment Gender Label Prediction

Original This Jerry Lewisripoff needs tojust go away al-ready. The guy isso good at act-ing like a foolbecause he is afool.

Male Hateful 0.9811

Swapped This Jerry Lewisripoff needs tojust go away al-ready. The gal isso good at act-ing like a foolbecause she is afool.

Female Hateful 0.9891

(model’s output probability). We generated this view for the hate-comments in particular, and discovered that profane language orcuss words turned out to be significant and positively impacting themodel - thereby driving it to classify these comments as “hateful”.

I OTHER ALTAI REQUIREMENTSALTAI Requirement #6 focuses on societal and environmental well-being and considers the environmental impact of AI systems. There


Table 7: Synthetic counterfactual dataset examples for reli-gion

TemplatesSubgroup:Christianity

Subgroup:Islam Label

I am a [Iden-tity] person,ask me any-thing

I am a Chris-tian person, askme anything

I am a Muslimperson, ask meanything

Not-hateful

[Identity] peo-ple are grossand universallyterrible

Christian peo-ple are grossand universallyterrible

Muslim peopleare gross anduniversally ter-rible

Hateful

Table 8: Prediction probability statistics for counterfactualcomments with religion references

Actual CommentType

Subgroup Avg. PredictedProbability

Not-hateful Islam 15.6% hatefulNot-hateful Christianity 11.2% hatefulHateful Islam 67.9% hatefulHateful Christianity 63.5% hateful

Table 9: Prediction probability statistics for counterfactualcomments with gender references

Actual CommentType

Subgroup Avg. PredictedProbability

Not-hateful Male 12.9% hatefulNot-hateful Female 13.4% hatefulHateful Male 57.4% hatefulHateful Female 57.2% hateful

exist different methods and mechanisms that can be used to eval-uate the associated environmental impacts. Our process uses theMachine Learning Emissions Calculator tool [12] to estimate theamount of carbon emissions produced by training the classifierunder assessment here. This can be measured in Carbon dioxideequivalent (CO2eq) units, which is “a metric measure used to com-pare the emissions from various greenhouse gases on the basis oftheir global-warming potential, by converting amounts of othergases to the equivalent amount of carbon dioxide with the sameglobal warming potential.” [6]. The model assessed in this studyhas been trained on CPUs locally and its carbon footprint wasconsidered to be negligible.

There are several additional ethical AI requirements that havenot been been applicable or covered in this paper. The model underassessment here has been trained offline on reliable datasets. Butresilience of the model to cyber-attacks, such as data poisoningand model evasion, is a critical dimension if the model is trainedonline and using comments labels through crowdsourcing (ALTAIRequirement #2). The model here used public datasets. But in the

Figure 6: Global interpretability of predictions using SHAP

event of use of or application to private data, the assessment shouldalso evaluate privacy dimensions (ALTAI Requirement #3). This alsoincludes data governance on both a macro and a micro level, withthe focus areas of data availability, sharing, usability, consistency,integrity.

The assessmentmay also account for the interests of stakeholdersand how much each requirement is important to each. The ideaof relevance matrix proposed by [3] can be used to connect themwith the models ethical performance. The assessment output canalso inform AI impact assessment [10], which are usually first partycreated, and not necessarily covered in second/third party reviews.