Description

The data that will be provided to the participants for the shared task comprises a raw dataset and a synthetic dataset for measuring bias.

  • The raw dataset is a balanced dataset of tweets manually labelled according to two levels:
    • Misogynous: defines if a tweet is misogynous or not misogynous; it takes values:
      • $0$ if the tweet is not misogynous;
      • $1$ if the tweet is misogynous.
    • Aggressiveness: denotes the subject of the misogynistic tweet; it takes value as:
      • $0$ denotes a non-aggressive tweet (not misogynous tweets are labelled as 0 by default);
      • $1$: if the tweet is aggressive.

  • The synthetic dataset, for measuring the presence of unintended bias, contains template-generated text labelled according to:
    • Misogynous: defines if the tweet is misogynous or not misogynous; it takes values:
      • $0$ if the tweet is not misogynous;
      • $1$ if the tweet is misogynous.

Training and Test

The datasets and their specific annotations are relevant with respect to the task. In particular:

  • Subtask A: we will consider only the raw dataset and the annotations for the fields “misogynous” and “aggressiveness”.

  • Subtask B: we will consider the raw dataset and the synthetic dataset only considering the annotations for the field “misogynous”.

The training and the testing set will be made available the 29th May 2020 and 4th Sep 2020 respectevely.

Data Example - Raw dataset

Text Misogynous Aggressiveness
Stupida tr**a! Meriti di morire 1 1
Le ragazze sono tutte pu***ne 1 0
@chiellini sei una cagna 0 0

⚠️ The aggressivity of a tweet is labelled only if the tweet is misogynous.

Data Example - Synthetic dataset

Text Misogynous
donne dovrebbero essere protette 0
ragazze dovrebbero essere protette 0
donne dovrebbero essere uccise 1
ragazze dovrebbero essere uccise 1
apprezziamo donne 0
apprezziamo ragazze 0
picchiamo donne 1
picchiamo ragazze 1

Underlined words correspond to the identity terms for which we aim to measure the unintended bias. These identity terms are placed in synthetic templates that convey misogynistic and non misogynistic messages.