A release via torrent
On 11 December 2023, Mistral AI posts a magnet torrent link on X with no press release, no blog post and no paper. The download contains Mixtral 8x7B, the first large Open Source model based on a Sparse Mixture-of-Experts (SMoE) architecture. The release style, which has become a signature of the company, favours direct distribution of model weights over institutional announcements.
The licence is Apache 2.0, with no commercial restrictions or acceptable-use clauses. The model becomes available on Hugging Face in the following days.
Sparse MoE architecture
Mixtral 8x7B is not a dense 56-billion-parameter model. It is a network with 8 feed-forward experts (FFN) and a per-token router that, for each incoming token, dynamically selects the 2 most relevant experts. Total parameters are 46.7 billion, but the number of parameters activated for each token is approximately 12.9 billion — a fraction of the computational cost of an equivalent dense model.
Routing is trained end-to-end with the rest of the network. The selection gate applies a softmax over the router logits and selects the top-2 experts, whose outputs are combined with proportional weights. Attention layers are shared across all experts; only feed-forward networks are specialised.
The context window is 32,768 tokens, consistent with the choices made for Mistral 7B. Tokenisation uses the same byte-fallback BPE as the dense model.
Performance
At release, Mixtral 8x7B outperforms Llama 2 70B on most public benchmarks and sits close to GPT-3.5 on several tasks, with significantly higher inference speed thanks to partial parameter activation.
Impact on the ecosystem
Mixtral demonstrates that MoE architectures, until then the domain of closed labs (Google Switch Transformer, GShard), are feasible in Open Source. It makes available weights of models with quality comparable to commercial solutions, usable in on-premise or private cloud environments without licensing constraints.
Link: mistral.ai