OLMo: the 'truly open' model from Allen Institute for AI

AI2 releases OLMo on 1 February 2024: weights, training code, Dolma dataset (3T tokens), intermediate checkpoints and training logs under Apache 2.0. Full reproducibility.

Open SourceAI Open SourceOLMoAllenAILLMAIReproducibility

Reproducibility as a release goal

On 1 February 2024 the Allen Institute for AI (AI2) releases OLMo — Open Language Model — in 1B and 7B variants. Unlike most LLMs described as “open” that publish only final weights, OLMo is distributed with every artefact needed to fully reproduce training:

  • Final model weights
  • Complete training code (in-house framework, PyTorch-based)
  • Dolma dataset (3 trillion tokens)
  • Intermediate checkpoints at regular intervals during training
  • Training logs with metrics, loss curves and exact configurations
  • Documentation of the data curation process

The licence is Apache 2.0 for weights and code; ODC-BY for the dataset.

The Dolma dataset

Dolma is a 3-trillion-token dataset built by AI2 aggregating public sources: Common Crawl, The Stack (code), Reddit, arXiv, Wikipedia, Project Gutenberg, Semantic Scholar. The filtering, deduplication and sensitive-content removal pipeline is fully documented and the preprocessing code is available as a Python package (dolma).

Dolma solves a recurring problem of previous “open” models: the impossibility of verifying what is actually in the training data and reproducing results without access to proprietary datasets.

Architecture and training

OLMo adopts a standard transformer architecture with RoPE and non-parametric layer norm. The two sizes:

  • OLMo 1B — trained on 3T tokens
  • OLMo 7B — trained on 2.5T tokens, 2048 context window

Published benchmarks show performance comparable to Llama 2 7B on main zero-shot metrics. The stated goal of the project is not to push the state of the art but to provide the scientific community with a complete base for studying LLM training mechanisms.

OLMo 2 and evolution

In November 2024 AI2 releases OLMo 2 in 7B and 13B variants. The new training mix improves quality and stability, and the model closes the gap on benchmarks against Llama 3.1 8B while remaining fully open in weights, code and data.

OLMo sets a transparency standard that few other model families (including BLOOM, EleutherAI’s Pythia, Amber) reach. It is a reference tool for academic LLM research.

Link: allenai.org/olmo

Need support? Under attack? Service Status
Need support? Under attack? Service Status