A model from the Emirates
In May 2023 the Technology Innovation Institute (TII), a government research centre in Abu Dhabi, releases Falcon 40B and Falcon 7B, two LLMs that top Hugging Face public leaderboards at release. The initial licence required royalties on commercial use beyond a certain revenue threshold; in June 2023 TII updates the licence to pure Apache 2.0, removing every restriction.
The RefinedWeb dataset
One of Falcon’s distinctive choices is the construction of the pre-training dataset. RefinedWeb is a dataset of roughly 5 trillion tokens derived from CommonCrawl through an aggressive filtering pipeline: fuzzy deduplication, removal of low-quality content, URL filters and text normalisation. TII has released a subset of 600 billion tokens of RefinedWeb as a contribution to the research community.
The accompanying paper argues that properly filtered web data can replace curated corpora (books, code, Wikipedia) as the primary pre-training source, simplifying the pipeline and increasing scalability.
Architecture
Falcon uses a decoder-only transformer architecture with a few specific choices:
- Multi-query attention (MQA) — sharing of a single key-value pair across all query heads, reducing memory and bandwidth during inference
- Rotary positional embeddings (RoPE) — relative positioning that allows extrapolation beyond training length
- Parallel attention + MLP — simultaneous computation of attention and feed-forward blocks for GPU parallelisation
Training was performed on AWS clusters with up to 4096 A100 GPUs.
Significance
Falcon represents one of the first examples of a frontier Open Source model developed outside the US-Europe-China axis, helping diversify the geography of the open LLM ecosystem.
Link: falconllm.tii.ae