BAM! Just Like that, Simple and Efficient Parameter Upcycling for Mixture of Experts
Qizhen Zhang, Nikolas Gritsch, Dwaraknath Gnaneshwar, Simon Guo, David Cairuz, Bharat Venkitesh, Jakob Foerster, Phil Blunsom, Sebastian Ruder, Ahmet Üstün, Acyr Locatelli
To appear at Conference on Neural Information Processing Systems (NeurIPS), 2024
NGSM (Spotlight) and ES-FoMo-II Workshop at International Conference on Machine Learning (ICML), 2024
Upcycling MoE with Mixture-of-Attention for more efficient MoE pre-training