Mixtures of Experts as an Auction: The BASE Transformer
How re-formulating MoE as an auction guarantees perfect load balance without auxiliary losses - and beats the Switch Transformer
Sparse Mixtures of Experts (MoE) has become a key technology in the latest generation of LLMs such as Google’s Switch Transformer, OpenAI’s GPT-4, Mistral AI’s Mixtral-8x7b, and more. In a nutshell, sparse MoE is an extremely powerful technology because - in theory - it allows us to scale up capacity of any model with a…
Keep reading with a 7-day free trial
Subscribe to Machine Learning Frontiers to keep reading this post and get 7 days of free access to the full post archives.