Shi, Y. et al. (2023) Train Faster, Perform Better: Modular Adaptive Training in Over-Parameterized Models. In: 37th Annual Conference on Neural Information Processing Systems (NeurIPS 2023), New Orleans, Louisiana, USA, 10-16 Dec 2023,
Text
316900.pdf - Accepted Version 2MB |
Publisher's URL: https://proceedings.neurips.cc/paper_files/paper/2023/hash/516fd05dc408fd6d6374940a83930193-Abstract-Conference.html
Abstract
Despite their prevalence in deep-learning communities, over-parameterized models convey high demands of computational costs for proper training. This work studies the fine-grained, modular-level learning dynamics of over-parameterized models to attain a more efficient and fruitful training strategy. Empirical evidence reveals that when scaling down into network modules, such as heads in self-attention models, we can observe varying learning patterns implicitly associated with each module's trainability. To describe such modular-level learning capabilities, we introduce a novel concept dubbed modular neural tangent kernel (mNTK), and we demonstrate that the quality of a module's learning is tightly associated with its mNTK's principal eigenvalue λ max . A large λ max indicates that the module learns features with better convergence, while those miniature ones may impact generalization negatively. Inspired by the discovery, we propose a novel training strategy termed Modular Adaptive Training (MAT) to update those modules with their λ max exceeding a dynamic threshold selectively, concentrating the model on learning common features and ignoring those inconsistent ones. Unlike most existing training schemes with a complete BP cycle across all network modules, MAT can significantly save computations by its partially-updating strategy and can further improve performance. Experiments show that MAT nearly halves the computational cost of model training and outperforms the accuracy of baselines.
Item Type: | Conference Proceedings |
---|---|
Additional Information: | This work was supported by National Natural Science Foundation of China under Grant No. 62090025, National Key Rand D Program of China under Grant No. 2022YFB4400400 and China Postdoctoral Science Foundation No. 2022M720767. |
Status: | Published |
Refereed: | Yes |
Glasgow Author(s) Enlighten ID: | Yang, Dr Xiaochen |
Authors: | Shi, Y., Chen, Y., Dong, M., Yang, X., Li, D., Wang, Y., Dick, R. P., Lv, Q., Zhao, Y., Yang, F., Lu, T., Gu, N., and Shang, L. |
Subjects: | Q Science > QA Mathematics > QA75 Electronic computers. Computer science |
College/School: | College of Science and Engineering > School of Mathematics and Statistics > Statistics |
Copyright Holders: | Copyright © The Author(s) 2023 |
First Published: | First published in Advances in Neural Information Processing Systems 36 (NeurIPS 2023) |
Publisher Policy: | Reproduced with the permission of the publisher |
University Staff: Request a correction | Enlighten Editors: Update this record