Train Faster, Perform Better: Modular Adaptive Training in Over-Parameterized Models

Shi, Y. et al. (2023) Train Faster, Perform Better: Modular Adaptive Training in Over-Parameterized Models. In: 37th Annual Conference on Neural Information Processing Systems (NeurIPS 2023), New Orleans, Louisiana, USA, 10-16 Dec 2023,

[img] Text
316900.pdf - Accepted Version

2MB

Publisher's URL: https://proceedings.neurips.cc/paper_files/paper/2023/hash/516fd05dc408fd6d6374940a83930193-Abstract-Conference.html

Abstract

Despite their prevalence in deep-learning communities, over-parameterized models convey high demands of computational costs for proper training. This work studies the fine-grained, modular-level learning dynamics of over-parameterized models to attain a more efficient and fruitful training strategy. Empirical evidence reveals that when scaling down into network modules, such as heads in self-attention models, we can observe varying learning patterns implicitly associated with each module's trainability. To describe such modular-level learning capabilities, we introduce a novel concept dubbed modular neural tangent kernel (mNTK), and we demonstrate that the quality of a module's learning is tightly associated with its mNTK's principal eigenvalue λ max . A large λ max indicates that the module learns features with better convergence, while those miniature ones may impact generalization negatively. Inspired by the discovery, we propose a novel training strategy termed Modular Adaptive Training (MAT) to update those modules with their λ max exceeding a dynamic threshold selectively, concentrating the model on learning common features and ignoring those inconsistent ones. Unlike most existing training schemes with a complete BP cycle across all network modules, MAT can significantly save computations by its partially-updating strategy and can further improve performance. Experiments show that MAT nearly halves the computational cost of model training and outperforms the accuracy of baselines.

Item Type:Conference Proceedings
Additional Information:This work was supported by National Natural Science Foundation of China under Grant No. 62090025, National Key Rand D Program of China under Grant No. 2022YFB4400400 and China Postdoctoral Science Foundation No. 2022M720767.
Status:Published
Refereed:Yes
Glasgow Author(s) Enlighten ID:Yang, Dr Xiaochen
Authors: Shi, Y., Chen, Y., Dong, M., Yang, X., Li, D., Wang, Y., Dick, R. P., Lv, Q., Zhao, Y., Yang, F., Lu, T., Gu, N., and Shang, L.
Subjects:Q Science > QA Mathematics > QA75 Electronic computers. Computer science
College/School:College of Science and Engineering > School of Mathematics and Statistics > Statistics
Copyright Holders:Copyright © The Author(s) 2023
First Published:First published in Advances in Neural Information Processing Systems 36 (NeurIPS 2023)
Publisher Policy:Reproduced with the permission of the publisher

University Staff: Request a correction | Enlighten Editors: Update this record