Internet delivery of time-synchronised multimedia: the SCOTS projects

Anderson, W. and Beavan, D. (2005) Internet delivery of time-synchronised multimedia: the SCOTS projects. In: Corpus Linguistics 2005, Birmingham, 14-17 July 2005,

[img]
Preview
Text
Anderson_%26_Beavan.pdf

333kB

Publisher's URL: http://www.corpus.bham.ac.uk/PCLC

Abstract

The Scottish Corpus of Texts and Speech (SCOTS) Project at Glasgow University aims to make available over the Internet a 4 million-word multimedia corpus of texts in the languages of Scotland. Twenty percent of this final total will comprise spoken language, in a combination of audio and video material. Versions of SCOTS have been accessible on the Internet since November 2004, and regular additions are made to the Corpus as texts are processed and functionality is improved. While the Corpus is a valuable resource for research, our target users also include the general public, and this has important implications for the nature of the Corpus and website. This paper will begin with a general introduction to the SCOTS Project, and in particular to the nature of our data. The main part of the paper will then present the approach taken to spoken texts. Transcriptions are made using Praat (Boersma and Weenink, University of Amsterdam), which produces a time-based transcription and allows for multiple speakers though independent tiers. This output is then processed to produce a turn-based transcription with overlap and non-linguistic noises indicated. As this transcription is synchronised with the source audio/video material it allows users direct access to any particular passage of the recording, possibly based upon a word query. This process and the end result will be demonstrated and discussed. We shall end by considering the value which is added to an Internet-delivered Corpus by these means of treating spoken text. The advantages include the possibility of returning search results from both written texts and multimedia documents; the easy location of the relevant section of the audio file; and the production through Praat of a turn-based orthographic transcription, which is accessible to a general as well as an academic user. These techniques can also be extended to other research requirements, such as the mark-up of gesture in video texts.

Item Type:Conference Proceedings
Keywords:Scottish Corpus of Texts & Speech, audiovisual data, synchronisation
Status:Published
Refereed:Yes
Glasgow Author(s) Enlighten ID:Anderson, Professor Wendy and Beavan, Mr David
Authors: Anderson, W., and Beavan, D.
Subjects:P Language and Literature > PE English
P Language and Literature > P Philology. Linguistics
College/School:College of Arts & Humanities > School of Critical Studies > English Language and Linguistics
Publisher:UCREL
ISSN:1747-9398
Copyright Holders:Copyright © 2005 UCREL
First Published:First published in Proceedings of the Corpus Linguistics 2005 conference
Publisher Policy:Reproduced in accordance with the copyright policy of the publisher

University Staff: Request a correction | Enlighten Editors: Update this record