Falcon: the Open Source model from the Technology Innovation Institute

TII (Abu Dhabi) releases Falcon 40B and 7B in May 2023, then Falcon 180B in September 2023. Apache 2.0 licence from June 2023, RefinedWeb dataset, multi-query attention.

Open SourceAI Open SourceFalconTIILLMAIUAE

A model from the Emirates

In May 2023 the Technology Innovation Institute (TII), a government research centre in Abu Dhabi, releases Falcon 40B and Falcon 7B, two LLMs that top Hugging Face public leaderboards at release. The initial licence required royalties on commercial use beyond a certain revenue threshold; in June 2023 TII updates the licence to pure Apache 2.0, removing every restriction.

The RefinedWeb dataset

One of Falcon’s distinctive choices is the construction of the pre-training dataset. RefinedWeb is a dataset of roughly 5 trillion tokens derived from CommonCrawl through an aggressive filtering pipeline: fuzzy deduplication, removal of low-quality content, URL filters and text normalisation. TII has released a subset of 600 billion tokens of RefinedWeb as a contribution to the research community.

The accompanying paper argues that properly filtered web data can replace curated corpora (books, code, Wikipedia) as the primary pre-training source, simplifying the pipeline and increasing scalability.

Architecture

Falcon uses a decoder-only transformer architecture with a few specific choices:

  • Multi-query attention (MQA) — sharing of a single key-value pair across all query heads, reducing memory and bandwidth during inference
  • Rotary positional embeddings (RoPE) — relative positioning that allows extrapolation beyond training length
  • Parallel attention + MLP — simultaneous computation of attention and feed-forward blocks for GPU parallelisation

Training was performed on AWS clusters with up to 4096 A100 GPUs.

Significance

Falcon represents one of the first examples of a frontier Open Source model developed outside the US-Europe-China axis, helping diversify the geography of the open LLM ecosystem.

Link: falconllm.tii.ae

Need support? Under attack? Service Status
Need support? Under attack? Service Status