One Million Posts Corpus

The “One Million Posts” corpus is an annotated data set consisting of user comments posted to an Austrian newspaper website (in German language). The dataset comprises approx. one million posts approx. 11K of which are manually annotated with the following categories: sentiment (negative/neutral/positive), off-Topic (yes/no), inappropriate (yes/no), discriminating (yes/no), feedback to the article author (yes/no), user personal stories (yes/no), arguments used (yes/no).

Publications

Dietmar Schabus, Marcin Skowron, Martin Trapp. "One Million Posts: A Data Set of German Online Discussions." In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pp. 1241-1244. Tokyo, Japan, August 2017. DOI: 10.1145/3077136.3080711. [Preprint]
Dietmar Schabus and Marcin Skowron. "Academic-Industrial Perspective on the Development and Deployment of a Moderation System for a Newspaper Website." In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018), pp. 1602-1605. Miyazaki, Japan, May 2018.

Authors

Dietmar Schabus
Marcin Skowron
Martin Trapp

Licence

Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International

Sponsor

Google

Digital News Innovation Fund

Visit dataset website

Download dataset

Key facts

Version
1.0.0
Release date
01 August 2017
Language
German
Modality
Text
Licence
CC BY-SA-NC 4.0
Contact
Marcin Skowron