Mixtures of Experts as an Auction: The BASE Transformer

How re-formulating MoE as an auction guarantees perfect load balance without auxiliary losses - and beats the Switch Transformer

Jan 13, 2024

∙ Paid

A metaphorical illustration showing tokens bidding for experts in a BASE layer of a Mixture of Experts model. The image should depict a group of tokens, represented as small circles or icons, standing in front of a row of experts, depicted as larger, distinguished figures or machines. Each token is shown raising a hand or holding a sign, indicating a bid, towards the expert it aligns with best. The experts should appear attentive, evaluating the bids from the tokens. This scene symbolizes the process of token-to-expert allocation in BASE layers, where tokens are metaphorically 'bidding' to be processed by the most suitable experts. The artwork should capture the dynamic and strategic nature of this allocation process, emphasizing the concept of optimal assignment in a playful and conceptual manner. — Tokens bidding on experts - DALL-E

Sparse Mixtures of Experts (MoE) has become a key technology in the latest generation of LLMs such as Google’s Switch Transformer, OpenAI’s GPT-4, Mistral AI’s Mixtral-8x7b, and more. In a nutshell, sparse MoE is an extremely powerful technology because - in theory - it allows us to scale up capacity of any model with a…

Keep reading with a 7-day free trial

Subscribe to Machine Learning Frontiers to keep reading this post and get 7 days of free access to the full post archives.