Multi-Modal Dialog State Tracking for Interactive Fashion Recommendation

Wu, Y., Macdonald, C. and Ounis, I. (2022) Multi-Modal Dialog State Tracking for Interactive Fashion Recommendation. In: ACM Conference on Recommender Systems (RecSys 2022), Seattle, USA, 18-23 Sep 2022, pp. 124-133. ISBN 9781450392785 (doi: 10.1145/3523227.3546774)

[img] Text
273808.pdf - Accepted Version



Multi-modal interactive recommendation is a type of task that allows users to receive visual recommendations and express natural-language feedback about the recommended items across multiple iterations of interactions. However, such multi-modal dialog sequences (i.e. turns consisting of the system’s visual recommendations and the user’s natural-language feedback) make it challenging to correctly incorporate the users’ preferences across multiple turns. Indeed, the existing formulations of interactive recommender systems suffer from their inability to capture the multi-modal sequential dependencies of textual feedback and visual recommendations because of their use of recurrent neural network-based (i.e., RNN-based) or transformer-based models. To alleviate the multi-modal sequential dependency issue, we propose a novel multi-modal recurrent attention network (MMRAN) model to effectively incorporate the users’ preferences over the long visual dialog sequences of the users’ natural-language feedback and the system’s visual recommendations. Specifically, we leverage a gated recurrent network (GRN) with a feedback gate to separately process the textual and visual representations of natural-language feedback and visual recommendations into hidden states (i.e. representations of the past interactions) for multi-modal sequence combination. In addition, we apply a multi-head attention network (MAN) to refine the hidden states generated by the GRN and to further enhance the model’s ability in dynamic state tracking. Following previous work, we conduct extensive experiments on the Fashion IQ Dresses, Shirts, and Tops & Tees datasets to assess the effectiveness of our proposed model by using a vision-language transformer-based user simulator as a surrogate for real human users. Our results show that our proposed MMRAN model can significantly outperform several existing state-of-the-art baseline models.

Item Type:Conference Proceedings
Glasgow Author(s) Enlighten ID:Macdonald, Professor Craig and Wu, Mr Yaxiong and Ounis, Professor Iadh
Authors: Wu, Y., Macdonald, C., and Ounis, I.
College/School:College of Science and Engineering > School of Computing Science
Copyright Holders:Copyright © 2022 Association for Computing Machinery
Publisher Policy:Reproduced in accordance with the copyright policy of the publisher

University Staff: Request a correction | Enlighten Editors: Update this record