GerMS-AT Dataset
The GerMS-AT (German Misogyny/Sexism - Austria) dataset contains user comments from an Austrian online newspaper. The comments have been annotated by 4 or more out of 11 annotators as to how strong sexism/mysogyny is present in the comment.
For each comment, the code of the annotator and the label assigned is given for all annotators which have annotated that comment. Labels represent the severity of any sexism/misogyny present in the comment from 0 (none), 1 (mild), 2 (present), 3 (strong) to 4 (severe).
The dataset contains 7984 comments. We provide the data using the same split as was used for the GermEval2024 GerMS-Detect shared task with a training set of 5998 comments and a test set of 1986 comments. No dev set is provided as the choice of dev set may be best left to the machine learning researcher/engineer.
A unique propery of this corpus is that it contains only a small portion of sexist/misogynyst remarks which use strong language, curse-words or otherwise blatantly offending terms, a large number of comments contain more subtle, indirect or at times ambiguous forms of sexism/misogyny.
Publications
- Brigitte Krenn, Johann Petrak, Marina Kubina, and Christian Burger. 2024. Germs-at: A sex-ism/misogyny dataset of forum comments from an Austrian online newspaper. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 7728–7739.
Authors
Licence
Sponsor
FemPower IKT 2018
- Version
1.0.0 - Release date
31 July 2024 - Language
German - Modality
Text - Licence
CC BY-SA-NC 4.0 - Associated project
FemDwell - Contact
Johann Petrak