Looking for the smoking gun: principled sampling in creating the tobacco industry documents corpus

Kretzschmar, W.A., Darwin, C., Brown, C., Rubin, D.L. and Biber, D. (2004) Looking for the smoking gun: principled sampling in creating the tobacco industry documents corpus. Journal of English Linguistics, 32(1), pp. 31-47. (doi: 10.1177/0075424204263024)

Full text not currently available from Enlighten.

Abstract

As a result of litigation over the past decade, major tobacco companies were compelled to make public a broad range of previously confidential documents. We have created a series of corpora from the tobacco industry documents (TIDs) for three purposes: (1) to establish baseline descriptions of various linguistic features of this unique set of texts; (2) to identify TIDs in which rhetorical manipulation (“deception”) may have occurred and to estimate the extent and prevalence of manipulation; (3) to analyze manipulation in order to classify it and develop means to identify similar manipulation in other industry document sets. Our threepart corpus creation strategy employed rigorous sampling methods. First, we drew a limited sample from the largest collection of TIDs, to determine a representative classification of text types and to estimate their proportions within the overall body of texts. Then, we created a reference corpus (500,000+ words) constituting a stratified random sample of all TIDs, whether or not they exhibit manipulation. Finally, we compiled a corpus of texts presumed to exhibit rhetorical manipulation. We assumed that multiple drafts of a text or versions of a text prepared for different audiences constituted rhetorical manipulation. This article presents our experience with the sampling methods utilized in this corpus-building process and our findings regarding text types comprising the reference corpus.

Item Type:Articles
Status:Published
Refereed:Yes
Glasgow Author(s) Enlighten ID:Kretzschmar, Professor William
Authors: Kretzschmar, W.A., Darwin, C., Brown, C., Rubin, D.L., and Biber, D.
College/School:College of Arts & Humanities > School of Critical Studies > English Language and Linguistics
Journal Name:Journal of English Linguistics
ISSN:0075-4242
ISSN (Online):1552-5457

University Staff: Request a correction | Enlighten Editors: Update this record