Hugging Face and ServiceNow are leading the collaborative project BigCode, which aims to responsibly construct huge language models for code. They unveiled StarCoder, a sizable language model for code learned by data from GitHub that is available under a permissive license and includes more than 80 programming languages, Git commits, GitHub issues, and Jupyter notebooks. On well-known programming benchmarks, StarCoderBase, a 15B parameter model trained on 1 trillion tokens, performs as well as or better than closed models like OpenAI’s code-cushman-001 and current open Code LLMs. A brand-new model called StarCoder, which has been optimized for 35 billion Python tokens, can analyze more data than any other open LLM and enables a variety of applications, including serving as a technical assistant, auto-completing code, editing code using instructions, and explaining code snippets in natural language. A modified version of the OpenRAIL license has been applied to StarCoder, making it easier for businesses to incorporate the model into their goods. On numerous benchmarks, StarCoder and StarCoderBase perform better than other large language models like PaLM, LaMDA, and LLaMA.

StartCoder at Hugging Face