Change Log

2025-05-14

Added support for Qwen 3 32B: qwen-3-32b.
Improvements to tool calling, including support for multi-turn tool calls.

2025-04-09

Added support for Meta’s newly released Llama 4 Scout Model: llama-4-scout-17b-16e-instruct.
Improvements to tool calling.

2025-02-27

We have improved support for structured outputs for chat completions, with constrained decoding when strict is set to true. This feature allows you to enforce consistent JSON outputs for models, which is useful when building applications that need to process AI-generated data programmatically.

2025-02-19

Added support for log_probs and top_log_probs in the chat/completions endpoint.

2024-12-18

Deprecation notice: The llama3.1-70b model will be automatically upgraded to llama-3.3-70b. Any existing references to llama3.1-70b in your code will continue to work during a short-term aliasing period. However, we strongly encourage you to update your references to the llama-3.3-70b model as soon as possible, since the aliasing will not be maintained indefinitely.

2024-12-10

Support for Llama 3.3 70B

We now support llama-3.3-70b, Meta’s newly released model that delivers enhanced performance across popular benchmarks for use cases including chat, coding, instruction following, mathematics, and reasoning. We serve this model at a speed of 2100+ tokens per second.

2024-11-20

Support for completions endpoint.

2024-10-24

Performance Upgrade: This release introduces speculative decoding, a technique that uses both a small model and a large model together to generate responses more quickly. Llama 3.1 70B now achieves an average output speed of 2,100 tokens per second. Please note that with speculative decoding, output speeds may fluctuate by up to 20% compared to the average.

2024-10-03

Continued Performance ImprovementsWe currently serve Llama-3.1-8B at ~2000 tokens/sec, and Llama-3.1-70B at ~560 tokens/second.Integration with AutoGenDevelopers can now use the Cerebras Inference API with Microsoft AutoGen, an open-source framework for building AI agents. AutoGen streamlines the creation of advanced LLM applications by managing multi-agent conversations and optimizing workflows. With this integration, users can leverage features like tool use and parallel tool calling, while benefiting from Cerebras’ fast inference with Llama 3.1 8B and 70B models. For an example, visit the documentation page here.Other Updates:

Users can now sign in to the developer playground using a magic link, without the need for setting up and remembering a password.
The max_tokens parameters has been renamed to max_completion_tokens, to maintain consistency with OpenAI’s syntax.
We have updated our documentation to include a list of our available integrations for the Cerebras Inference SDK.

Get Started

Capabilities

Integrations

Support