Introducing CAD: the Contextual Abuse Dataset

Vidgen, B., Nguyen, D., Margetts, H., Rossini, P. and Tromble, R. (2021) Introducing CAD: the Contextual Abuse Dataset. In: 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 06-11 Jun 2021, pp. 2289-2303. (doi: 10.18653/v1/2021.naacl-main.182)

[img] Text
272734.pdf - Published Version
Available under License Creative Commons Attribution.

318kB

Publisher's URL: https://www.aclweb.org/anthology/2021.naacl-main.182

Abstract

Online abuse can inflict harm on users and communities, making online spaces unsafe and toxic. Progress in automatically detecting and classifying abusive content is often held back by the lack of high quality and detailed datasets.We introduce a new dataset of primarily English Reddit entries which addresses several limitations of prior work. It (1) contains six conceptually distinct primary categories as well as secondary categories, (2) has labels annotated in the context of the conversation thread, (3) contains rationales and (4) uses an expert-driven group-adjudication process for high quality annotations. We report several baseline models to benchmark the work of future researchers. The annotated dataset, annotation guidelines, models and code are freely available.

Item Type:Conference Proceedings
Additional Information:This work was supported by Wave 1 of The UKRI Strategic Priorities Fund under the EPSRC Grant EP/T001569/1, particularly the “Criminal Justice System” theme within that grant, and The Alan Turing Institute.
Status:Published
Refereed:Yes
Glasgow Author(s) Enlighten ID:Rossini, Dr Patricia
Authors: Vidgen, B., Nguyen, D., Margetts, H., Rossini, P., and Tromble, R.
College/School:College of Social Sciences > School of Social and Political Sciences > Politics
Copyright Holders:Copyright © 2021 Association for Computational Linguistics
First Published:First published in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: 2289-2303
Publisher Policy:Reproduced under a Creative Commons License

University Staff: Request a correction | Enlighten Editors: Update this record