Evaluation Metrics

Each participating team will initially have access only to the training data (both raw and synthetic data). Later, we will release the unlabelled test data (both raw and synthetic data). After the assessment, the labels for the test data will be released.

The evaluation will be performed according to the following metric:

SubTask A: The ranking will be computed by averaging the $F_1$ measures estimated for the Misogynous and Aggressiveness classes.

$ score_A = \frac{F_1(\text{Misogynous}) + F_1(\text{Aggressiveness})}{2} $

SubTask B: The ranking will be computed by the weighted combination of AUC computed on the test raw dataset AUC_raw and three per-term AUC-based bias scores computed on the synthetic dataset (AUC_Subgroup, AUC_BPSN, AUC_BNSP). Let $s$ be an identity-term and $N$ be the number of identity-terms, the evaluation will be performed according to the following metric:

$ score_B = \frac{1}{2} \text{AUC}_{\text{raw}} + \frac{1}{2} \frac{\sum_s\text{AUC}_{\text{Subgroup}}(s) + \sum_s\text{AUC}_{\text{BPSN}}(s) + \sum_s\text{AUC}_{\text{BNSP}}(s) }{N} $

Unintended Bias Metrics

Unintended bias can be uncovered by looking at differences in the score distributions between data mentioning a specific identity-term $s$ (subgroup $s$ distribution) and the rest (background distribution). Following [ 1, 2], the three per-term AUC-based bias scores are related to specific subgroups as follows:

AUC_Subgroup($s$): calculates AUC only on the data within the subgroup $s$. This represents model understanding and separability within the subgroup itself. A low value in this metric means the model does a poor job of distinguishing between misogynous and non-misogynous comments that mention the identity.
AUC_BPSN($s$): Background Positive Subgroup Negative (BPSN) calculates AUC on the misogynous examples from the background and the non-misogynous examples from the subgroup. A low value in this metric means that the model confuses non-misogynous examples that mention the identity-term with misogynous examples that do not, likely meaning that the model predicts higher misogynous scores than it should for non-misogynous examples mentioning the identity-term.
AUC_BNSP($s$): Background Negative Subgroup Positive (BNSP) calculates AUC on the non-misogynous examples from the background and the misogynous examples from the subgroup. A low value here means that the model confuses misogynous examples that mention the identity with non-misogynous examples that do not, likely meaning that the model predicts lower misogynous scores than it should for misogynous examples mentioning the identity.

Reference

1. Borkan D., Dixon L., Sorensen J., Thain N., and Vasserman L. (2019). Nuanced Metrics for Measuring Unintended Bias with Real Data for Text Classification. In Companion of The 2019 World Wide Web Conference (WWW 2019). ACM.

2. Jigsaw Unintended Bias in Toxicity Classification Kaggle Competition