Reproducibility as a release goal
On 1 February 2024 the Allen Institute for AI (AI2) releases OLMo — Open Language Model — in 1B and 7B variants. Unlike most LLMs described as “open” that publish only final weights, OLMo is distributed with every artefact needed to fully reproduce training:
- Final model weights
- Complete training code (in-house framework, PyTorch-based)
- Dolma dataset (3 trillion tokens)
- Intermediate checkpoints at regular intervals during training
- Training logs with metrics, loss curves and exact configurations
- Documentation of the data curation process
The licence is Apache 2.0 for weights and code; ODC-BY for the dataset.
The Dolma dataset
Dolma is a 3-trillion-token dataset built by AI2 aggregating public sources: Common Crawl, The Stack (code), Reddit, arXiv, Wikipedia, Project Gutenberg, Semantic Scholar. The filtering, deduplication and sensitive-content removal pipeline is fully documented and the preprocessing code is available as a Python package (dolma).
Dolma solves a recurring problem of previous “open” models: the impossibility of verifying what is actually in the training data and reproducing results without access to proprietary datasets.
Architecture and training
OLMo adopts a standard transformer architecture with RoPE and non-parametric layer norm. The two sizes:
- OLMo 1B — trained on 3T tokens
- OLMo 7B — trained on 2.5T tokens, 2048 context window
Published benchmarks show performance comparable to Llama 2 7B on main zero-shot metrics. The stated goal of the project is not to push the state of the art but to provide the scientific community with a complete base for studying LLM training mechanisms.
OLMo 2 and evolution
In November 2024 AI2 releases OLMo 2 in 7B and 13B variants. The new training mix improves quality and stability, and the model closes the gap on benchmarks against Llama 3.1 8B while remaining fully open in weights, code and data.
OLMo sets a transparency standard that few other model families (including BLOOM, EleutherAI’s Pythia, Amber) reach. It is a reference tool for academic LLM research.
Link: allenai.org/olmo
