An open and responsible code LLM
On 4 May 2023, the BigCode project — a joint initiative by Hugging Face and ServiceNow Research — releases StarCoder, a code language model with 15.5 billion parameters. StarCoder is distributed under the BigCode Open RAIL-M v1 licence, a Responsible AI licence that allows commercial use while retaining some ethics-based usage restrictions.
The BigCode project was created with the stated goal of producing code models that are open, traceable and built with respect for the rights of source code authors. Unlike other programming models, weights, datasets, training processes and evaluation tools are fully published.
Architecture and capabilities
StarCoder adopts a context window of 8,192 tokens, significantly wider than most contemporary code LLMs, and supports the Fill-in-the-Middle (FIM) mode which lets the model complete code sections given both prefix and suffix. This capability makes it particularly suited for integration into code editors, where completion typically happens in the middle of the file rather than only at the end.
Training was performed on The Stack, a dataset of more than 80 programming languages built from public repositories with permissive licences. BigCode implemented a formal opt-out process: developers can request removal of their code from the dataset, and the request is effectively applied in subsequent versions of the corpus.
Licence and implications
BigCode Open RAIL-M v1 allows commercial use, redistribution and modification, while setting limits on specific use categories (disinformation, unlawful surveillance, harm to people). For the software development ecosystem, StarCoder stands as a reference point as an open code model built on a verifiable data supply chain.
Link: huggingface.co/bigcode