Narvala, H. , McDonald, G. and Ounis, I. (2023) Identifying chronological and coherent information threads using 5W1H questions and temporal relationships. Information Processing and Management, 60(3), 103274. (doi: 10.1016/j.ipm.2023.103274)
Text
289345.pdf - Published Version Available under License Creative Commons Attribution. 1MB |
Abstract
Due to the massive volume of articles produced online every day, it is challenging for online platforms (e.g., news agencies) to present the information about an event, activity or discussion to their users in an easily digestible format. Therefore, there is a need for automatic methods to extract related and time-ordered information about events (i.e., information threads) from large unstructured collections of documents. In this work, we propose a novel unsupervised hierarchical agglomerative clustering (HAC) based information threading approach to generate chronological and coherent threads of information in a collection. Unlike, the well-known tasks of topic detection and tracking or event threading that focus on grouping information by important keywords and/or entities, our proposed approach identifies threads based on temporal relations and diverse information about an event, i.e., who did what, why, where, when and how (aka the 5W1H questions). In particular, our proposed approach, deploys a tailored similarity function for HAC by leveraging extracted answers to 5W1H questions along with time decay between documents. We evaluate our proposed HAC 5W1H information threading approach on two large expert-annotated collections of news articles, i.e., NewSHead and Multi-News (over 112k and 32k articles, respectively). Our experiments show that HAC 5W1H markedly improves the number of, and quality of, threads that are generated compared to existing state-of-the-art approaches from the literature, e.g., 100.98% more threads and +213.39% improvement in Normalised Mutual Information compared to the best evaluated baseline on the larger NewSHead collection. We also conducted a user study that shows that our proposed HAC 5W1H information threading approach is significantly (p < 0.05) preferred by users in terms of coherence, diversity and chronological correctness compared to the existing state-of-the-art approaches.
Item Type: | Articles |
---|---|
Status: | Published |
Refereed: | Yes |
Glasgow Author(s) Enlighten ID: | McDonald, Dr Graham and Narvala, Hitarth and Ounis, Professor Iadh |
Creator Roles: | Narvala, H.Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Project administration, Writing – original draft McDonald, G.Supervision, Conceptualization, Methodology, Investigation, Project administration, Writing – review and editing Ounis, I.Supervision, Conceptualization, Methodology, Investigation, Project administration, Writing – review and editing |
Authors: | Narvala, H., McDonald, G., and Ounis, I. |
College/School: | College of Science and Engineering > School of Computing Science |
Journal Name: | Information Processing and Management |
Publisher: | Elsevier |
ISSN: | 0306-4573 |
ISSN (Online): | 1873-5371 |
Published Online: | 18 January 2023 |
Copyright Holders: | Copyright © 2023 The Authors |
First Published: | First published in Information Processing and Management 60(3): 103274 |
Publisher Policy: | Reproduced under a Creative Commons License |
University Staff: Request a correction | Enlighten Editors: Update this record