Linux Services & Systems

Domains, hosting, PEC, email infrastructure, network services and Linux systems. Open Source infrastructure support and management.

Discover →

Reproducibility as a release goal

On 1 February 2024 the Allen Institute for AI (AI2) releases OLMo — Open Language Model — in 1B and 7B variants. Unlike most LLMs described as “open” that publish only final weights, OLMo is distributed with every artefact needed to fully reproduce training:

Final model weights
Complete training code (in-house framework, PyTorch-based)
Dolma dataset (3 trillion tokens)
Intermediate checkpoints at regular intervals during training
Training logs with metrics, loss curves and exact configurations
Documentation of the data curation process

The licence is Apache 2.0 for weights and code; ODC-BY for the dataset.

The Dolma dataset

Dolma is a 3-trillion-token dataset built by AI2 aggregating public sources: Common Crawl, The Stack (code), Reddit, arXiv, Wikipedia, Project Gutenberg, Semantic Scholar. The filtering, deduplication and sensitive-content removal pipeline is fully documented and the preprocessing code is available as a Python package (dolma).

Dolma solves a recurring problem of previous “open” models: the impossibility of verifying what is actually in the training data and reproducing results without access to proprietary datasets.

Architecture and training

OLMo adopts a standard transformer architecture with RoPE and non-parametric layer norm. The two sizes:

OLMo 1B — trained on 3T tokens
OLMo 7B — trained on 2.5T tokens, 2048 context window

Published benchmarks show performance comparable to Llama 2 7B on main zero-shot metrics. The stated goal of the project is not to push the state of the art but to provide the scientific community with a complete base for studying LLM training mechanisms.

OLMo 2 and evolution

In November 2024 AI2 releases OLMo 2 in 7B and 13B variants. The new training mix improves quality and stability, and the model closes the gap on benchmarks against Llama 3.1 8B while remaining fully open in weights, code and data.

OLMo sets a transparency standard that few other model families (including BLOOM, EleutherAI’s Pythia, Amber) reach. It is a reference tool for academic LLM research.

Link: allenai.org/olmo

Company

Actions

Links

Products

Solutions

Industries

OLMo: the 'truly open' model from Allen Institute for AI

Linux Services & Systems

Reproducibility as a release goal

The Dolma dataset

Architecture and training

OLMo 2 and evolution