Mixtral: open source Sparse Mixture-of-Experts from Mistral AI

Mistral AI releases Mixtral 8x7B on 11 December 2023 via torrent: Sparse Mixture-of-Experts architecture, 8 experts of 7B, 2 active per token, 32K context, Apache 2.0 licence.

Open SourceAI Open SourceMixtralMoELLMAIMistral

A release via torrent

On 11 December 2023, Mistral AI posts a magnet torrent link on X with no press release, no blog post and no paper. The download contains Mixtral 8x7B, the first large open source model based on a Sparse Mixture-of-Experts (SMoE) architecture. The release style, which has become a signature of the company, favours direct distribution of model weights over institutional announcements.

The licence is Apache 2.0, with no commercial restrictions or acceptable-use clauses. The model becomes available on Hugging Face in the following days.

Sparse MoE architecture

Mixtral 8x7B is not a dense 56-billion-parameter model. It is a network with 8 feed-forward experts (FFN) and a per-token router that, for each incoming token, dynamically selects the 2 most relevant experts. Total parameters are 46.7 billion, but the number of parameters activated for each token is approximately 12.9 billion — a fraction of the computational cost of an equivalent dense model.

Routing is trained end-to-end with the rest of the network. The selection gate applies a softmax over the router logits and selects the top-2 experts, whose outputs are combined with proportional weights. Attention layers are shared across all experts; only feed-forward networks are specialised.

The context window is 32,768 tokens, consistent with the choices made for Mistral 7B. Tokenisation uses the same byte-fallback BPE as the dense model.

Performance and successors

At release, Mixtral 8x7B outperforms Llama 2 70B on most public benchmarks and sits close to GPT-3.5 on several tasks, with significantly higher inference speed thanks to partial parameter activation.

In April 2024, Mistral releases Mixtral 8x22B — 141 billion total parameters, 39 billion active, 64K context — also under Apache 2.0. The MoE line runs alongside the dense Mistral models (7B, then Mistral Large), sharing the same philosophy of open weights and permissive licensing.

Impact on the ecosystem

Mixtral demonstrated that MoE architectures, until then the domain of closed labs (Google Switch Transformer, GShard), are feasible in open source. It made available weights of models with quality comparable to commercial solutions, usable in on-premise or private cloud environments without licensing constraints.

Link: mistral.ai

Need support? Under attack? Service Status
Need support? Under attack? Service Status