The data that will be provided to the participants for the shared task comprises a raw dataset and a synthetic dataset for measuring bias.
The datasets and their specific annotations are relevant with respect to the task. In particular:
Subtask A: we will consider only the raw dataset and the annotations for the fields “misogynous” and “aggressiveness”.
Subtask B: we will consider the raw dataset and the synthetic dataset only considering the annotations for the field “misogynous”.
The training and the testing set will be made available the 29th May 2020 and 4th Sep 2020 respectevely.
Text | Misogynous | Aggressiveness |
---|---|---|
Stupida tr**a! Meriti di morire | 1 | 1 |
Le ragazze sono tutte pu***ne | 1 | 0 |
@chiellini sei una cagna | 0 | 0 |
⚠️ The aggressivity of a tweet is labelled only if the tweet is misogynous.
Text | Misogynous |
---|---|
donne dovrebbero essere protette | 0 |
ragazze dovrebbero essere protette | 0 |
donne dovrebbero essere uccise | 1 |
ragazze dovrebbero essere uccise | 1 |
apprezziamo donne | 0 |
apprezziamo ragazze | 0 |
picchiamo donne | 1 |
picchiamo ragazze | 1 |
Underlined words correspond to the identity terms for which we aim to measure the unintended bias. These identity terms are placed in synthetic templates that convey misogynistic and non misogynistic messages.