A release via torrent
On 11 December 2023, Mistral AI posts a magnet torrent link on X with no press release, no blog post and no paper. The download contains Mixtral 8x7B, the first large open source model based on a Sparse Mixture-of-Experts (SMoE) architecture. The release style, which has become a signature of the company, favours direct distribution of model weights over institutional announcements.
The licence is Apache 2.0, with no commercial restrictions or acceptable-use clauses. The model becomes available on Hugging Face in the following days.
Sparse MoE architecture
Mixtral 8x7B is not a dense 56-billion-parameter model. It is a network with 8 feed-forward experts (FFN) and a per-token router that, for each incoming token, dynamically selects the 2 most relevant experts. Total parameters are 46.7 billion, but the number of parameters activated for each token is approximately 12.9 billion — a fraction of the computational cost of an equivalent dense model.
Routing is trained end-to-end with the rest of the network. The selection gate applies a softmax over the router logits and selects the top-2 experts, whose outputs are combined with proportional weights. Attention layers are shared across all experts; only feed-forward networks are specialised.
The context window is 32,768 tokens, consistent with the choices made for Mistral 7B. Tokenisation uses the same byte-fallback BPE as the dense model.
Performance and successors
At release, Mixtral 8x7B outperforms Llama 2 70B on most public benchmarks and sits close to GPT-3.5 on several tasks, with significantly higher inference speed thanks to partial parameter activation.
In April 2024, Mistral releases Mixtral 8x22B — 141 billion total parameters, 39 billion active, 64K context — also under Apache 2.0. The MoE line runs alongside the dense Mistral models (7B, then Mistral Large), sharing the same philosophy of open weights and permissive licensing.
Impact on the ecosystem
Mixtral demonstrated that MoE architectures, until then the domain of closed labs (Google Switch Transformer, GShard), are feasible in open source. It made available weights of models with quality comparable to commercial solutions, usable in on-premise or private cloud environments without licensing constraints.
Link: mistral.ai
