Building a Document Genre Corpus: a Profile of the KRYS I Corpus

Berninger, V., Kim, Y. and Ross, S. (2008) Building a Document Genre Corpus: a Profile of the KRYS I Corpus. In: BCS-IRSG Workshop on Corpus Profiling, London, 18 October 2008,

This is the latest version of this item.

[img] Text


Publisher's URL:


This paper describes the KRYS I corpus (, consisting of documents classified into 70 genre classes. It has been constructed as part of an effort to automate document genre classification as distinct from topic detection. Previously there has been very little work on building corpora of texts which have been classified using a non-topical genre palette. The reason for this is partly due to the fact that genre as a concept, is rooted in philosophy, rhetoric and literature, and highly complex and domain dependent in its interpretation ([11]). The usefulness of genre in everyday information search is only now starting to be recognised and there is no genre classification schema that has been consolidated to have applicable value in this direction. By presenting here our experiences in constructing the KRYS I corpus, we hope to shed light on the information gathering and seeking behaviour and the role of genre in these activities, as well as a way forward for creating a better corpus for testing automated genre classification tasks and the application of these tasks to other domains.

Item Type:Conference Proceedings
Keywords:genre classification, digital preservation, metadata extraction, corpus building, ground truth
Glasgow Author(s) Enlighten ID:Kim, Dr Yunhyong and Ross, Professor Seamus
Authors: Berninger, V., Kim, Y., and Ross, S.
Subjects:T Technology > T Technology (General)
Z Bibliography. Library Science. Information Resources > Z665 Library Science. Information Science
College/School:College of Arts > School of Humanities > Information Studies

Available Versions of this Item

University Staff: Request a correction | Enlighten Editors: Update this record

Project CodeAward NoProject NamePrincipal InvestigatorFunder's NameFunder RefLead Dept
365851DELOS - Network of Excellence on Digital LibrariesSeamus RossEuropean Commission (EC)507618Humanities Advanced Technology and Information Institute
374172National Digital Curation Centre (NDCC)Seamus RossEngineering & Physical Sciences Research Council (EPSRC)GR/T07374/01 H3Humanities Advanced Technology and Information Institute
374176National Digital Curation Centre (NDCC)Seamus RossJoint Information Systems Committee (JISC)NAK/JAB/NESC005Humanities Advanced Technology and Information Institute