prediction request parameter. The model reuses matching tokens and regenerates only those that differ, improving output generation speed.
When to Use Predicted Outputs
Use Predicted Outputs in scenarios where most of the model’s response is already known or can be pre-computed. Recommended use cases include:- Code Refactoring: Modify known code without regenerating from scratch (e.g., tab/inline completion, full-file edits, structural transformations)
- Document Editing: Apply small edits to known documents (e.g., grammar fixes, tone adjustments)
- Template Filling: Update placeholders or small sections in predictable structured text
Usage
For example, imagine you want to modify a CSS file to change the color of all body text from green to blue. To use Predicted Outputs, include the code snippet below as both part of both your prompt and the predicted output:1
Initial Setup
Begin by importing the Cerebras SDK and setting up the client.
2
Include expected content
Include the
prediction parameter in your chat completions request in addition to your normal messages. Set the prediction field to include the content you expect to be reused.Token-Reuse Metrics
For the best performance, your prediction should have a high token-reuse rate. The response includes usage metrics showing how many prediction tokens were accepted or rejected:Best Practices
- Use this when most of the output is known: The larger the known section, the greater the efficiency gain. Predicted Outputs work best when you can anticipate significant portions of the response.
-
Set
temperature=0: Reduces randomness and increases the likelihood of token acceptance from your prediction. - Keep predictions accurate: Misaligned predictions increase rejected tokens and can slow generation speed. Ensure your prediction closely matches the expected output.
- Monitor prediction metrics: Track accepted vs rejected tokens in the usage metadata to evaluate effectiveness.
- Fallback gracefully: If the rejection rate is high for a class of prompts or files, fall back to a standard completion request without the prediction field.
Limitations
Please consider the following limitations when using Predicted Outputs.- The following models are currently supported:
gpt-oss-120b,qwen-3-32b - When you provide a prediction, any tokens that do not appear in the final completion are still billed at completion-token rates. To determine how many predicted tokens were not used, review the
rejected_prediction_tokensproperty in theusageobject. - The following API parameters are not supported when using this feature:
logprobs: not supportedn: values greater than 1 are not supportedtools: tool calling is not currently supported with Predicted Outputs
FAQ
Does this increase API costs?
Does this increase API costs?
Only when predicted tokens are not accepted. Input and output tokens are billed at standard rates, while rejected prediction tokens are billed at the output token rate. Customers with dedicated endpoints are not affected by this pricing.
How do I know if my prediction was accepted?
How do I know if my prediction was accepted?
Check
accepted_prediction_tokens and rejected_prediction_tokens in the response’s usage object.What happens if my predicted text is wrong?
What happens if my predicted text is wrong?
The model rejects mismatched tokens and regenerates them, which may reduce output generation speed and increase costs since rejected tokens are billed as additional output tokens.
Does Cerebras store prediction data?
Does Cerebras store prediction data?
No, we do not store any prediction data.

