StarCoder: the open code LLM from the BigCode project

BigCode releases StarCoder 15.5B on 4 May 2023: 8K context, fill-in-the-middle, training on The Stack with honoured opt-out, BigCode Open RAIL-M v1 licence.

Open SourceAI Open SourceStarCoderBigCodeCode LLMHuggingFaceAI

An open and responsible code LLM

On 4 May 2023, the BigCode project — a joint initiative by Hugging Face and ServiceNow Research — releases StarCoder, a code language model with 15.5 billion parameters. StarCoder is distributed under the BigCode Open RAIL-M v1 licence, a Responsible AI licence that allows commercial use while retaining some ethics-based usage restrictions.

The BigCode project was created with the stated goal of producing code models that are open, traceable and built with respect for the rights of source code authors. Unlike other programming models, weights, datasets, training processes and evaluation tools are fully published.

Architecture and capabilities

StarCoder adopts a context window of 8,192 tokens, significantly wider than most contemporary code LLMs, and supports the Fill-in-the-Middle (FIM) mode which lets the model complete code sections given both prefix and suffix. This capability makes it particularly suited for integration into code editors, where completion typically happens in the middle of the file rather than only at the end.

Training was performed on The Stack, a dataset of more than 80 programming languages built from public repositories with permissive licences. BigCode implemented a formal opt-out process: developers can request removal of their code from the dataset, and the request is effectively applied in subsequent versions of the corpus.

StarCoder2 and project evolution

On 29 February 2024, BigCode releases StarCoder2 in three sizes — 3B, 7B and 15B — trained on The Stack v2, an expanded dataset covering more than 600 languages. The StarCoder2 licence is derived from the BigCode one with adjustments that make it closer to Apache 2.0, while still retaining responsible-use clauses.

Licence and implications

BigCode Open RAIL-M v1 allows commercial use, redistribution and modification, while setting limits on specific use categories (disinformation, unlawful surveillance, harm to people). For the software development ecosystem, StarCoder has served as a reference point as an open code model built on a verifiable data supply chain.

Link: huggingface.co/bigcode

Need support? Under attack? Service Status
Need support? Under attack? Service Status