Identifying chronological and coherent information threads using 5W1H questions and temporal relationships

Narvala, H. , McDonald, G. and Ounis, I. (2023) Identifying chronological and coherent information threads using 5W1H questions and temporal relationships. Information Processing and Management, 60(3), 103274. (doi: 10.1016/j.ipm.2023.103274)

[img] Text
289345.pdf - Published Version
Available under License Creative Commons Attribution.

1MB

Abstract

Due to the massive volume of articles produced online every day, it is challenging for online platforms (e.g., news agencies) to present the information about an event, activity or discussion to their users in an easily digestible format. Therefore, there is a need for automatic methods to extract related and time-ordered information about events (i.e., information threads) from large unstructured collections of documents. In this work, we propose a novel unsupervised hierarchical agglomerative clustering (HAC) based information threading approach to generate chronological and coherent threads of information in a collection. Unlike, the well-known tasks of topic detection and tracking or event threading that focus on grouping information by important keywords and/or entities, our proposed approach identifies threads based on temporal relations and diverse information about an event, i.e., who did what, why, where, when and how (aka the 5W1H questions). In particular, our proposed approach, deploys a tailored similarity function for HAC by leveraging extracted answers to 5W1H questions along with time decay between documents. We evaluate our proposed HAC 5W1H information threading approach on two large expert-annotated collections of news articles, i.e., NewSHead and Multi-News (over 112k and 32k articles, respectively). Our experiments show that HAC 5W1H markedly improves the number of, and quality of, threads that are generated compared to existing state-of-the-art approaches from the literature, e.g., 100.98% more threads and +213.39% improvement in Normalised Mutual Information compared to the best evaluated baseline on the larger NewSHead collection. We also conducted a user study that shows that our proposed HAC 5W1H information threading approach is significantly (p < 0.05) preferred by users in terms of coherence, diversity and chronological correctness compared to the existing state-of-the-art approaches.

Item Type:Articles
Status:Published
Refereed:Yes
Glasgow Author(s) Enlighten ID:McDonald, Dr Graham and Narvala, Hitarth and Ounis, Professor Iadh
Creator Roles:
Narvala, H.Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Project administration, Writing – original draft
McDonald, G.Supervision, Conceptualization, Methodology, Investigation, Project administration, Writing – review and editing
Ounis, I.Supervision, Conceptualization, Methodology, Investigation, Project administration, Writing – review and editing
Authors: Narvala, H., McDonald, G., and Ounis, I.
College/School:College of Science and Engineering > School of Computing Science
Journal Name:Information Processing and Management
Publisher:Elsevier
ISSN:0306-4573
ISSN (Online):1873-5371
Published Online:18 January 2023
Copyright Holders:Copyright © 2023 The Authors
First Published:First published in Information Processing and Management 60(3): 103274
Publisher Policy:Reproduced under a Creative Commons License

University Staff: Request a correction | Enlighten Editors: Update this record