Falcon: the open source model from the Technology Innovation Institute

TII (Abu Dhabi) releases Falcon 40B and 7B in May 2023, then Falcon 180B in September 2023. Apache 2.0 licence from June 2023, RefinedWeb dataset, multi-query attention.

Open SourceAI Open SourceFalconTIILLMAIUAE

A model from the Emirates

In May 2023 the Technology Innovation Institute (TII), a government research centre in Abu Dhabi, releases Falcon 40B and Falcon 7B, two LLMs that top Hugging Face public leaderboards at release. The initial licence required royalties on commercial use beyond a certain revenue threshold; in June 2023 TII updates the licence to pure Apache 2.0, removing every restriction.

In September 2023 follows Falcon 180B, with 180 billion parameters trained on 3.5 trillion tokens: at release, it is the largest open source LLM available.

The RefinedWeb dataset

One of Falcon’s distinctive choices is the construction of the pre-training dataset. RefinedWeb is a dataset of roughly 5 trillion tokens derived from CommonCrawl through an aggressive filtering pipeline: fuzzy deduplication, removal of low-quality content, URL filters and text normalisation. TII has released a subset of 600 billion tokens of RefinedWeb as a contribution to the research community.

The accompanying paper argues that properly filtered web data can replace curated corpora (books, code, Wikipedia) as the primary pre-training source, simplifying the pipeline and increasing scalability.

Architecture

Falcon uses a decoder-only transformer architecture with a few specific choices:

  • Multi-query attention (MQA) — sharing of a single key-value pair across all query heads, reducing memory and bandwidth during inference
  • Rotary positional embeddings (RoPE) — relative positioning that allows extrapolation beyond training length
  • Parallel attention + MLP — simultaneous computation of attention and feed-forward blocks for GPU parallelisation

Training was performed on AWS clusters with up to 4096 A100 GPUs.

Successors

In March 2024 TII releases Falcon 2 (11B), followed in 2024 by the Falcon 3 line with 1B, 3B, 7B and 10B variants in base, instruct and mamba formats. The most recent models adopt the TII Falcon License 2.0, compatible with standard commercial use but with responsible-use clauses.

Falcon represents one of the few examples of a frontier open source model developed outside the US-Europe-China axis, and has helped diversify the geography of open LLM ecosystem.

Link: falconllm.tii.ae

Need support? Under attack? Service Status
Need support? Under attack? Service Status