BLOOM: the multilingual model from BigScience

The BigScience workshop coordinated by Hugging Face releases BLOOM on 12 July 2022: 176B parameters, 46 natural languages + 13 programming languages, Responsible AI Licence.

Open SourceAI Open SourceBLOOMBigScienceLLMAIMultilingue

A collective scientific project

BLOOM — BigScience Large Open-science Open-access Multilingual Language Model — is the outcome of the BigScience workshop, a year-long collaborative scientific initiative (May 2021 – May 2022) coordinated by Hugging Face and involving over 1000 researchers from more than 70 countries. The model is released on 12 July 2022.

BigScience is structured like a traditional academic workshop, with thematic working groups: architecture, data, tokenisation, engineering, ethics, evaluation, project governance. The resulting model is conceived from the start as a public research artefact, not a commercial product.

Technical characteristics

BLOOM is a decoder-only transformer with 176 billion parameters and the following features:

  • 46 natural languages, with particular coverage of under-represented languages (Spanish, Arabic, African languages, Indian languages)
  • 13 programming languages
  • BPE tokeniser with a 250,680-token vocabulary designed to balance languages
  • ALiBi-based architecture (attention with linear biases) for positional encoding

Training was performed on the Jean Zay supercomputer at IDRIS/CNRS in France, with 384 NVIDIA A100 80GB GPUs for 117 days, using a computational budget provided by GENCI (Grand Équipement National de Calcul Intensif).

The ROOTS training dataset covers 1.6 TB of text in 46 languages, with documented composition and provenance.

The Responsible AI Licence

BLOOM is released under the Responsible AI Licence (RAIL), a licence that includes usage restrictions for specific categories (for example mass surveillance, disinformation, generation of illegal content). It is therefore not an Open Source Initiative (OSI) approved licence in the strict sense — the “open access” concept adopted by BigScience is distinct from traditional “open source”.

Usage restrictions are listed in an annex to the licence and apply to model use, not to code redistribution.

Legacy

BLOOM set a precedent for multinational collaborative research on frontier LLMs. BLOOMZ variants (fine-tuned for multilingual instruction-following) were released later. The project directly influenced subsequent initiatives such as OLMo and the growth of the Hugging Face ecosystem.

Link: huggingface.co/bigscience/bloom

Need support? Under attack? Service Status
Need support? Under attack? Service Status