A model from the Emirates
In May 2023 the Technology Innovation Institute (TII), a government research centre in Abu Dhabi, releases Falcon 40B and Falcon 7B, two LLMs that top Hugging Face public leaderboards at release. The initial licence required royalties on commercial use beyond a certain revenue threshold; in June 2023 TII updates the licence to pure Apache 2.0, removing every restriction.
In September 2023 follows Falcon 180B, with 180 billion parameters trained on 3.5 trillion tokens: at release, it is the largest open source LLM available.
The RefinedWeb dataset
One of Falcon’s distinctive choices is the construction of the pre-training dataset. RefinedWeb is a dataset of roughly 5 trillion tokens derived from CommonCrawl through an aggressive filtering pipeline: fuzzy deduplication, removal of low-quality content, URL filters and text normalisation. TII has released a subset of 600 billion tokens of RefinedWeb as a contribution to the research community.
The accompanying paper argues that properly filtered web data can replace curated corpora (books, code, Wikipedia) as the primary pre-training source, simplifying the pipeline and increasing scalability.
Architecture
Falcon uses a decoder-only transformer architecture with a few specific choices:
- Multi-query attention (MQA) — sharing of a single key-value pair across all query heads, reducing memory and bandwidth during inference
- Rotary positional embeddings (RoPE) — relative positioning that allows extrapolation beyond training length
- Parallel attention + MLP — simultaneous computation of attention and feed-forward blocks for GPU parallelisation
Training was performed on AWS clusters with up to 4096 A100 GPUs.
Successors
In March 2024 TII releases Falcon 2 (11B), followed in 2024 by the Falcon 3 line with 1B, 3B, 7B and 10B variants in base, instruct and mamba formats. The most recent models adopt the TII Falcon License 2.0, compatible with standard commercial use but with responsible-use clauses.
Falcon represents one of the few examples of a frontier open source model developed outside the US-Europe-China axis, and has helped diversify the geography of open LLM ecosystem.
Link: falconllm.tii.ae
