> ## Documentation Index
> Fetch the complete documentation index at: https://inference-docs.cerebras.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Change Log

For model deprecations, see [Deprecations](/support/deprecation).

<Update label="2026-05-01">
  **Projects is now generally available**

  [Projects](/console/projects) is out of private preview and available to all organizations. Use Projects to group API keys, set per-project rate limits, segment usage analytics, and manage member access across isolated workspaces. Every organization starts with a Default Project, so existing setups are unchanged.
</Update>

<Update label="2026-04-27">
  **New dedicated models: GLM 5, GLM 5.1, and Kimi K2.6**

  Z.AI's GLM 5 and GLM 5.1 and Moonshot AI's Kimi K2.6 are now supported on [Dedicated Endpoints](/dedicated/overview#supported-models).
</Update>

<Update label="2026-04-24">
  **Validation Errors Now Return `400` Instead of `422`**

  API requests that fail validation now return HTTP **`400 Bad Request`** instead of **`422 Unprocessable Entity`**, aligning the HTTP status code with standard behavior across major inference providers. This applies to all validation errors, including missing required fields, malformed request bodies, and unsupported parameters. The error response body schema is unchanged.

  **SDK impact**

  If you use the Cerebras or OpenAI Python or Node.js SDK, validation errors now raise `BadRequestError` instead of `UnprocessableEntityError`. Both SDKs map HTTP status codes to distinct exception classes.

  |                    | Before                     | After             |
  | ------------------ | -------------------------- | ----------------- |
  | HTTP status code   | `422`                      | `400`             |
  | Python exception   | `UnprocessableEntityError` | `BadRequestError` |
  | Node.js error name | `UnprocessableEntityError` | `BadRequestError` |

  **Action required**

  * If your code handles `UnprocessableEntityError` by name or checks for status code `422`, update to `BadRequestError` or `400`.
  * If you use raw HTTP and branch on the status code, update `422` to `400`.
  * No changes are needed if your error handling catches the base class (`APIStatusError` or `APIError`).
</Update>

<Update label="2026-04-22">
  **New parameter: `prompt_cache_key`**

  [/v1/chat/completions](/api-reference/chat-completions#param-prompt-cache-key) and [/v1/completions](/api-reference/completions#param-prompt-cache-key) now accept an optional [`prompt_cache_key`](/capabilities/prompt-caching#maximize-cache-hits-with-prompt_cache_key) parameter. Requests that share the same key are routed together so they reuse the same [prompt cache](/capabilities/prompt-caching), increasing cache hits and reducing time to first token. Set it to your conversation ID for chat sessions, your user or session ID for per-user workloads, or a hash of your shared prefix for RAG and agentic workloads.
</Update>

<Update label="2026-04-10">
  **Payload Optimization: msgpack and gzip**

  [/v1/chat/completions](/api-reference/chat-completions) and [/v1/completions](/api-reference/completions) now accept request bodies encoded with [msgpack](https://msgpack.org/) (`Content-Type: application/vnd.msgpack`) and/or compressed with gzip (`Content-Encoding: gzip`). This can significantly reduce payload size and improve TTFT for requests with long prompts. See [Payload Optimization](/capabilities/payload-optimization) for details and examples.
</Update>

<Update label="2026-03-31">
  **New Sampling Parameters**

  The following parameters are now available on all models:

  * [`frequency_penalty`](/api-reference/chat-completions#param-frequency-penalty) – Reduces repetition by penalizing tokens based on how often they have appeared so far
  * [`presence_penalty`](/api-reference/chat-completions#param-presence-penalty) – Reduces repetition by penalizing tokens that have appeared at least once
  * [`logit_bias`](/api-reference/chat-completions#param-logit-bias) – Adjusts the likelihood of specific tokens appearing in the response
</Update>

<Update label="2026-03-30">
  **Streaming: Server-Sent Events (SSE)**

  [Streaming](/capabilities/streaming) responses for Chat Completions (`/v1/chat/completions`) and Completions (`/v1/completions`) now emit a standard `data: [DONE]` SSE event at the end of the stream, following the SSE standard (see the [living specification](https://html.spec.whatwg.org/multipage/server-sent-events.html#server-sent-events)).

  The `[DONE]` marker is guaranteed to appear exactly once and is always the last event in the stream.
</Update>

<Update label="2026-03-27">
  Rolling out selective 4-bit weight-only quantization for supported models, targeting non-sensitive layers while preserving full precision for sensitive layers, in line with industry standards. Learn more in [Supported Models](/models/overview#model-compression).
</Update>

<Update label="2026-03-26">
  **Temperature Parameter: Increased Maximum**

  The maximum allowed value for the `temperature` parameter for [Chat Completions](/api-reference/chat-completions) has been increased from **1.5** to **2.0**. This aligns with the range supported by OpenAI, Fireworks, and other major LLM providers. Higher temperature values produce more random and creative outputs.
</Update>

<Update label="2026-03-26">
  **`developer` Message Role (gpt-oss-120b)**

  The [Chat Completions](/api-reference/chat-completions) API now supports the `developer` message role for `gpt-oss-120b`, aligning with the OpenAI API message format. The `developer` role is functionally equivalent to `system` — the `system` role remains supported for backwards compatibility. Other models do not support the `developer` role.
</Update>

<Update label="2026-03-25">
  **Logprob Consistency**

  Logprobs are now computed from the model's raw output, before applying temperature scaling. Previously, logprob values would vary with temperature — they now remain consistent regardless of the temperature used. This aligns with standard behavior across major inference providers and vLLM. Sampled outputs are unaffected.
</Update>

<Update label="2026-03-24">
  **Deprecating `disable_reasoning` parameter for `zai-glm-4.7`**

  The `disable_reasoning` parameter is deprecated and will no longer be supported after **July 21, 2026**.

  Migrate to `reasoning_effort="none"` to disable reasoning on `zai-glm-4.7`. See the [Reasoning guide](/capabilities/reasoning) and [Chat Completions API reference](/api-reference/chat-completions) for details.
</Update>

<Update label="2026-02-27">
  **`reasoning_effort="none"` support for GLM 4.7**

  `zai-glm-4.7` now accepts the `reasoning_effort` parameter. Reasoning is enabled by default — set `reasoning_effort="none"` to disable it for simple tasks where verbose reasoning blocks are unnecessary. The `disable_reasoning` parameter continues to be supported but will be deprecated soon.

  See the [Chat Completions API reference](/api-reference/chat-completions) for full parameter details.
</Update>

<Update label="2026-01-22">
  Monitor your dedicated inference endpoints with the new **Metrics API**, providing Prometheus-compatible metrics for requests, tokens, latency percentiles, and endpoint health.

  Metrics are aggregated at the minute level and integrate directly with Prometheus. Learn more in our [Metrics guide](/capabilities/metrics).
</Update>

<Update label="2026-01-21">
  **API Version 2 Available for Testing**

  API version 2 is now available for testing via the `X-Cerebras-Version-Patch` header. This version introduces stricter validation and breaking changes.

  **Key changes in version 2:**

  * **Structured Outputs** - When using `strict: true`, stricter schema validation requiring `additionalProperties: false` at all object levels
  * **Tool Calling** - Stricter message validation for multi-turn conversations (tool response completeness, orphan tool messages, tool choice consistency, unique tool call IDs)
  * **Reasoning Models** - Adds a separate `reasoning_logprobs` field for reasoning token logprobs
  * **Unicode Fix** - Logprobs now reflect partial Unicode tokens as they appear in the model's vocabulary

  **Timeline:**

  * **January 21, 2026** - Version 2 available for testing via header
  * **July 21, 2026** - Version 2 becomes the default; older versions end-of-life

  Test now by adding `extra_headers={"X-Cerebras-Version-Patch": "2"}` to your requests. See [Versions](/api-reference/versions) for migration details.
</Update>

<Update label="2026-01-14">
  Prioritize requests to the API to balance latency sensitivity and resource allocation across your workloads with the new **Service Tiers** feature.

  Learn more in our [Service Tiers guide](/capabilities/service-tiers).
</Update>

<Update label="2026-01-09">
  **Constrained Decoding Generally Available**

  Constrained decoding has moved to General Availability (GA) with expanded capabilities.

  **Supported models**

  [Structured Outputs](/capabilities/structured-outputs#understanding-strict-mode) and [Tool Calling](/capabilities/tool-use#strict-mode-for-tool-calling) with `strict: true` are now available on the following models:

  * `gpt-oss-120b`
  * `llama-3.1-8b`
  * `qwen-3-235b-a22b-instruct-2507`
  * `qwen-3-32b`
  * `zai-glm-4.6`
  * `zai-glm-4.7`

  **New capabilities**

  * **Parallel Tool Calling + Constrained Decoding** - Models that support parallel tool calling can now use constrained decoding for all tool calls in a single response
  * **Expanded JSON Schema support** - Additional schema features are now supported
  * **Streaming + Reasoning + Constrained Decoding** - All three features can now be used together in a single request
</Update>

<Update label="2026-01-06">
  Added preview support for [Z.ai GLM 4.7](/models/zai-glm-47): `zai-glm-4.7`
</Update>

<Update label="2025-12-18">
  Process large-scale inference workloads asynchronously with the new **Batch API** and **Files API**. Submit up to 50,000 requests in a single batch job and get 50% cost savings.

  Learn more in our [Batch guide](/capabilities/batch).
</Update>

<Update label="2025-12-17">
  Added support for parallel tool calling, allowing models to request multiple tool invocations simultaneously in a single response. This reduces latency when handling queries that require multiple independent operations.

  Learn more in our [Tool Calling guide](/capabilities/tool-use#parallel-tool-calling).
</Update>

<Update label="2025-12-16">
  **Streaming Support**

  [Streaming](/capabilities/streaming) is now supported for all models, including with structured output requests (`response_format`, `tools`).

  **Reasoning Format Controls**

  Added `reasoning_format` parameter for reasoning models, giving you control over how reasoning text appears in responses. Options include:

  * `parsed` - Returns reasoning in a separate `reasoning` field with logprobs separated into `reasoning_logprobs`
  * `text_parsed` - Returns reasoning in a separate `reasoning` field (logprobs stay together)
  * `raw` - Includes reasoning inline with content using `<think>` tags
  * `hidden` - Drops reasoning text from the response (tokens still counted)

  See [Reasoning](/capabilities/reasoning) for details.

  **Logprobs with Structured Outputs**

  `logprobs` are now supported for structured output requests (`response_format`, `tools`), matching OpenAI's API behavior.

  When constrained decoding limits the valid token set below the requested `top_logprobs`, the response is padded with additional tokens from the model vocabulary, each assigned a logprob of `-100`. If only a single valid token remains after constraint masking, that token will have a logprob of `0` (probability 1). The output will not contain duplicate tokens.

  **Multi-turn Tool Calling for Llama 3.3 70B**

  Multi-turn tool calling is now supported for `llama-3.3-70b`. A bug preventing this functionality has been resolved.
</Update>

<Update label="2025-12-10">
  Added support for prompt caching, which automatically stores and reuses prompt prefixes to reduce latency and improve time to first token for similar requests. Learn more in our [Prompt Caching guide](/capabilities/prompt-caching).
</Update>

<Update label="2025-11-24">
  We now support Predicted Outputs for faster generation when large portions of the output are known in advance. Provide a draft of the expected response via the `prediction` field, and the model will efficiently reuse unchanged tokens while regenerating only those that differ.

  Ideal for code editing, document revisions, and template-based responses. Learn more in our [Predicted Outputs guide](/capabilities/predicted-outputs).
</Update>

<Update label="2025-11-14">
  **Deprecated `qwen-3-235b-a22b-thinking-2507`**

  We recommend migrating to [GPT OSS 120B](/models/openai-oss).
</Update>

<Update label="2025-11-07">
  Added preview support for Z.ai GLM 4.6: `zai-glm-4.6`
</Update>

<Update label="2025-11-05">
  **Deprecated `qwen-3-coder-480b`**. We recommend migrating to [Z.ai GLM 4.7](/models/zai-glm-47).
</Update>

<Update label="2025-11-03">
  **Deprecated `llama-4-scout-17b-16e-instruct`**

  See [Deprecations](/support/deprecation) for more details.
</Update>

<Update label="2025-10-22">
  Added support for retrieving log probabilities via the `logprobs` parameter for the [Completions endpoint](/api-reference/completions#param-logprobs). Set to an integer (0-20) to return the most likely alternative tokens at each position, or null to disable.
</Update>

<Update label="2025-10-15">
  **Deprecated `llama-4-maverick-17b-128e-instruct`**

  See [Deprecations](/support/deprecation) for more details.
</Update>

<Update label="2025-10-06">
  We're gradually introducing multi-token streaming across our models. Instead of sending tokens individually, we'll now deliver them in batches through 200 evenly-spaced events per second. This change eliminates the artificial delays that occurred with single-token streaming. Please verify that your application doesn't depend on specific token delivery patterns.
</Update>

<Update label="2025-10-02">
  **Updated support for OpenAI GPT-OSS (`gpt-oss-120b`)**

  * Tool calling with `strict: true` (constrained decoding)
  * Response format with `json_schema` when using `strict=true`
  * Response format with `json_object` when using `strict=true`
  * `tool_choice: none`
  * Updated JSON Schema limitations:
    * Nested levels increases from 5 to 10
    * Max properties increases from 100 to 500
    * Enabled `maximum`, `minimum`, `multipleOf` fields
    * `required` no longer needs to specify all properties
    * `anyOf` is no longer limited to 5 properties
</Update>

<Update label="2025-08-13">
  Added production support for OpenAI GPT OSS: `gpt-oss-120b`
</Update>

<Update label="2025-08-12">
  **Deprecated `deepseek-r1-distill-llama-70b`**

  We recommend migrating to [Qwen 3 32B ](/models/qwen-3-32b).
</Update>

<Update label="2025-08-05">
  Added preview support for OpenAI GPT OSS: `gpt-oss-120b`
</Update>

<Update label="2025-07-31">
  Added preview support for Qwen 3 Coder 480B: `qwen-3-coder-480b`.
</Update>

<Update label="2025-07-31">
  Added preview support for Qwen 3 235B Thinking: `qwen-3-235b-a22b-thinking-2507`.
</Update>

<Update label="2025-07-29">
  **`Deprecated qwen-3-235b-a22b`**

  We recommend migrating to either [Qwen 3 235B Instruct](/models/qwen-3-235b-2507) (available now) or Qwen 3 235B Thinking (coming soon).
</Update>

<Update label="2025-07-29">
  Added preview support for Qwen 3 235B Instruct: `qwen-3-235b-a22b-instruct-2507`.
</Update>

<Update label="2025-07-18">
  Added preview support for Llama 4 Maverick: `llama-4-maverick-17b-128e-instruct`.
</Update>

<Update label="2025-07-09">
  Added support for Qwen 3 235B: `qwen-3-235b-a22b`.
</Update>

<Update label="2025-05-14">
  * Added support for Qwen 3 32B: `qwen-3-32b`.
  * Improvements to tool calling, including support for multi-turn tool calls.
</Update>

<Update label="2025-05-14">
  * Added support for Qwen 3 32B: `qwen-3-32b`.
  * Improvements to tool calling, including support for multi-turn tool calls.
</Update>

<Update label="2025-04-09">
  * Added support for Meta's newly released Llama 4 Scout Model: `llama-4-scout-17b-16e-instruct`.
  * Improvements to tool calling.
</Update>

<Update label="2025-02-27">
  We have improved support for [structured outputs for chat completions](/capabilities/structured-outputs), with constrained decoding when `strict` is set to `true`. This feature allows you to enforce consistent JSON outputs for models, which is useful when building applications that need to process AI-generated data programmatically.
</Update>

<Update label="2025-02-19">
  Added support for `log_probs` and `top_log_probs` in the [`chat/completions`](/api-reference/chat-completions) endpoint.
</Update>

<Update label="2024-12-18">
  Deprecation notice: The `llama3.1-70b` model will be automatically upgraded to `llama-3.3-70b`. Any existing references to `llama3.1-70b` in your code will continue to work during a short-term aliasing period. However, we strongly encourage you to update your references to the `llama-3.3-70b` model as soon as possible, since the aliasing will not be maintained indefinitely.
</Update>

<Update label="2024-12-10">
  **Support for Llama 3.3 70B**

  * We now support [`llama-3.3-70b`](https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_3/), Meta's newly released model that delivers enhanced performance across popular benchmarks for use cases including chat, coding, instruction following, mathematics, and reasoning. We serve this model at a speed of [2100+ tokens per second](https://artificialanalysis.ai/models/llama-3-3-instruct-70b/providers#speed).
</Update>

<Update label="2024-11-20">
  Support for [`completions`](/api-reference/completions) endpoint.
</Update>

<Update label="2024-10-24">
  Performance Upgrade: This release introduces speculative decoding, a technique that uses both a small model and a large model together to generate responses more quickly. Llama 3.1 70B now achieves an average output speed of 2,100 tokens per second. Please note that with speculative decoding, output speeds may fluctuate by up to 20% compared to the average.
</Update>

<Update label="2024-10-03">
  **Continued Performance Improvements**

  We currently serve Llama-3.1-8B at \~2000 tokens/sec, and Llama-3.1-70B at \~560 tokens/second.

  **Integration with AutoGen**

  Developers can now use the Cerebras Inference API with [Microsoft AutoGen](https://www.microsoft.com/en-us/research/project/autogen/), an open-source framework for building AI agents. AutoGen streamlines the creation of advanced LLM applications by managing multi-agent conversations and optimizing workflows. With this integration, users can leverage features like tool use and parallel tool calling, while benefiting from Cerebras' fast inference with Llama 3.1 8B and 70B models. For an example, visit the documentation page [here](https://microsoft.github.io/autogen/docs/topics/non-openai-models/cloud-cerebras).

  **Other Updates**:

  * Users can now sign in to the [developer playground](https://cloud.cerebras.ai/?utm_source=3pi_change-log\&utm_campaign=support) using a magic link, without the need for setting up and remembering a password.
  * The `max_tokens` parameters has been renamed to `max_completion_tokens`, to maintain consistency with OpenAI's syntax.
  * We have updated our documentation to include a list of our available [integrations](/integrations) for the Cerebras Inference SDK.
</Update>
