Enlighten Publications

In this section

Building a Large-scale Corpus for Evaluating Event Detection on Twitter

McMinn, A. J., Moshfeghi, Y. and Jose, J. M. (2013) Building a Large-scale Corpus for Evaluating Event Detection on Twitter. In: 22nd ACM International Conference on Information and Knowledge Management, San Francisco, CA, USA, 27 Oct - 01 Nov 2013, pp. 409-418. ISBN 9781450322638 (doi: 10.1145/2505515.2505695)

Full text not currently available from Enlighten.

Abstract

Despite the popularity of Twitter for research, there are very few publicly available corpora, and those which are available are either too small or unsuitable for tasks such as event detection. This is partially due to a number of issues associated with the creation of Twitter corpora, including restrictions on the distribution of the tweets and the difficultly of creating relevance judgements at such a large scale. The difficulty of creating relevance judgements for the task of event detection is further hampered by ambiguity in the definition of event. In this paper, we propose a methodology for the creation of an event detection corpus. Specifically, we first create a new corpus that covers a period of 4 weeks and contains over 120 million tweets, which we make available for research. We then propose a definition of event which fits the characteristics of Twitter, and using this definition, we generate a set of relevance judgements aimed specifically at the task of event detection. To do so, we make use of existing state-of-the-art event detection approaches and Wikipedia to generate a set of candidate events with associated tweets. We then use crowdsourcing to gather relevance judgements, and discuss the quality of results, including how we ensured integrity and prevented spam. As a result of this process, along with our Twitter corpus, we release relevance judgements containing over 150,000 tweets, covering more than 500 events, which can be used for the evaluation of event detection approaches.

Item Type:	Conference Proceedings
Keywords:	Crowdsourcing, event detection, mechanical turk, reproducibility, social media, test collection, twitter corpus.
Status:	Published
Refereed:	Yes
Glasgow Author(s) Enlighten ID:	Jose, Professor Joemon and Moshfeghi, Dr Yashar and MCMINN, Andrew
Authors:	McMinn, A. J., Moshfeghi, Y., and Jose, J. M.
College/School:	College of Science and Engineering > School of Computing Science
Publisher:	ACM
ISBN:	9781450322638

University Staff: Request a correction | Enlighten Editors: Update this record

Funder and Project Information

Project Code	Award No	Project Name	Principal Investigator	Funder's Name	Funder Ref	Lead Dept
57279	1	LiMoSINe: Linguistically Motivated Semantic aggregatIon eNgines	Joemon Jose	European Commission (EC)	288024	COM - COMPUTING SCIENCE

Deposit and Record Details

ID Code:	128478
Depositing User:	Dr Yashar Moshfeghi
Datestamp:	03 Oct 2016 15:26
Last Modified:	22 Sep 2021 22:20
Date of first online publication:	2013
Data Availability Statement:	No