On building a reusable Twitter corpus

Mccreadie, R. , Soboroff, I., Lin, J., Macdonald, C. , Ounis, I. and McCullough, D. (2012) On building a reusable Twitter corpus. In: SIGIR 2012: 35th Annual International ACM SIGIR Conference on Research and Development on Information Retrieval, Portland OR, USA, 12-16 Aug 2012, pp. 1113-1114. (doi: 10.1145/2348283.2348495)

Full text not currently available from Enlighten.

Publisher's URL: http://dx.doi.org/10.1145/2348283.2348495


The Twitter real-time information network is the subject of research for information retrieval tasks such as real-time search. However, so far, reproducible experimentation on Twitter data has been impeded by restrictions imposed by the Twitter terms of service. In this paper, we detail a new methodology for legally building and distributing Twitter corpora, developed through collaboration between the Text REtrieval Conference (TREC) and Twitter. In particular, we detail how the first publicly available Twitter corpus - referred to as Tweets2011 - was distributed via lists of tweet identifiers and specialist tweet crawling software. Furthermore, we analyse whether this distribution approach remains robust over time, as tweets in the corpus are removed either by users or Twitter itself. Tweets2011 was successfully used by 58 participating groups for the TREC 2011 Microblog track, while our results attest to the robustness of the crawling methodology over time.

Item Type:Conference Proceedings
Additional Information:ISBN: 9781450314725
Glasgow Author(s) Enlighten ID:Mccreadie, Dr Richard and Macdonald, Professor Craig and Ounis, Professor Iadh
Authors: Mccreadie, R., Soboroff, I., Lin, J., Macdonald, C., Ounis, I., and McCullough, D.
Subjects:Q Science > QA Mathematics > QA75 Electronic computers. Computer science
College/School:College of Science and Engineering > School of Computing Science

University Staff: Request a correction | Enlighten Editors: Update this record