Chat Completions
POST
Generate conversational responses using a structured message format with roles (system, user, assistant, developer, tool). Best for chatbots, assistants, and multi-turn conversations. Parameter support can differ depending on the model used to generate the response, particularly for newer reasoning models. For details about parameters in reasoning models, refer to the Reasoning Guide.Documentation Index
Fetch the complete documentation index at: https://inference-docs.cerebras.ai/llms.txt
Use this file to discover all available pages before exploring further.
Request
Headers
The media type of the request body.Supported values:
application/json, application/vnd.msgpackDefault: application/jsonSee Payload Optimization for details.The compression encoding applied to the request body.Supported values:
gzipWhen set, the request body must be gzip-compressed. Can be combined with any supported Content-Type.See Payload Optimization for details.Controls the queue time threshold for requests using the Valid range:
flex or auto service tiers. Requests are preemptively rejected if the rolling average queue time exceeds this threshold.This feature is in Private Preview. For access or more information, contact us or reach out to your account representative.
50 - 20000 (milliseconds)Default: System default if not specifiedSee Service Tiers for more information.Body
A list of messages comprising the conversation so far.
Available options:
gpt-oss-120bzai-glm-4.7(preview)
Controls whether thinking content from previous conversation turns is included in the prompt context.Note: Thinking content from the current (latest unfinished) turn is always included regardless of this setting.
false- Thinking from all previous turns is preserved in the conversation history. Recommended for agentic workflows where reasoning from past tool-calling turns may be relevant for future tool calls.true(default) - Thinking from earlier turns is excluded. Recommended for general chat conversations where reasoning from past turns is less relevant for performance.
null, the API defaults to clear_thinking: true.This parameter is supported only on the zai-glm-4.7 model. For additional information, see Preserved thinking in the Z.ai documentation.
A number between -2.0 and 2.0. Positive values reduce the likelihood of the model repeating tokens by applying a penalty proportional to how frequently each token has already appeared in the generated output.Minimum:
-2, Maximum: 2Default: 0Modify the likelihood of specified tokens appearing in the completion. Accepts a JSON object that maps tokens (specified by their token ID in the tokenizer) to an associated bias value from -100 to 100. Mathematically, the bias is added to the logits generated by the model prior to sampling. The exact effect will vary per model, but values between -1 and 1 should decrease or increase likelihood of selection; values like -100 or 100 should result in a ban or exclusive selection of the relevant token.Default:
nullWhether to return log probabilities of the output tokens or not.Default:
falseThe maximum number of tokens that can be generated in the completion, including reasoning tokens. The total length of input tokens and generated tokens is limited by the model’s context length.
Whether to enable parallel function calling during tool use. When enabled (default), the model can request multiple tool calls simultaneously in a single response. When disabled, the model will only request one tool call at a time.Default:
trueConfiguration for a Predicted Output, which can greatly speed up response times when large parts of the model response are known in advance. This is most common when you are regenerating a file with mostly minor changes to the content.Visit our page on Predicted Outputs for more information and examples.
A number between -2.0 and 2.0. Positive values reduce the likelihood of the model repeating tokens that have already appeared in the output, encouraging the model to introduce new topics.Minimum:
-2, Maximum: 2Default: 0An opaque identifier that groups related requests so they reuse the same prompt cache. Requests sharing the same
prompt_cache_key are routed together, which increases cache hits and reduces time to first token.Set it to a stable identifier like a conversation ID, user ID, or session ID.Maximum length: 1024 charactersDefault: nullprompt_cache_key must be enabled on your account before you can use it. Contact us or reach out to your account representative to request access.Controls the amount of reasoning the model performs. Supported values vary by model:gpt-oss-120b
"low"– Minimal reasoning, faster responses"medium"– Moderate reasoning (default)"high"– Extensive reasoning, more thorough analysis
"none"– Disables reasoning entirely
This parameter is only available for gpt-oss-120b and zai-glm-4.7 models.
An object that controls the format of the model response.Setting to
{ "type": "json_schema", "json_schema": { "name": "schema_name", "strict": true, "schema": {...} } } enforces schema compliance by ensuring that the model output conforms to your specified JSON schema. See Structured Outputs for more information.Setting { "type": "json_object" } enables the legacy JSON mode, ensuring that the model output is valid JSON. However, using json_schema is recommended for models that support it.If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same
seed and parameters should return the same result. Determinism is not guaranteed.Controls request prioritization.Available options:
This feature is in Private Preview. For access or more information, contact us or reach out to your account representative.
priority- Highest priority processing (Only available for dedicated endpoints, not shared endpoints.)default- Standard priority processingauto- Automatically uses the highest available service tierflex- Lowest priority processing
defaultSee Service Tiers for more information.Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
If set, partial message deltas will be sent.
What sampling temperature to use, between 0 and 2.0. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.Minimum:
0, Maximum: 2Controls which (if any) tool is called by the model.
none means the model will not call any tool and instead generates a message. auto means the model can pick between generating a message or calling one or more tools. required means the model must call one or more tools. Specifying a particular tool via {"type": "function", "function": {"name": "my_function"}} forces the model to call that tool.none is the default when no tools are present. auto is the default if tools are present.A list of tools the model may call. Currently, only functions are supported as a tool. Use this to provide a list of functions the model may generate JSON inputs for.Specifying tools consumes prompt tokens in the context. If too many are given, the model may perform poorly or you may hit context length limitations
An integer between 0 and 20 specifying the number of most likely tokens to return at each token position, each with an associated log probability.
logprobs must be set to true if this parameter is used.Minimum: 0, Maximum: 20An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So, 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.Minimum:
0, Maximum: 1A unique identifier representing your end-user, which can help to monitor and detect abuse.
Response
A unique identifier for the chat completion.
A list of chat completion choices. Can be more than one if
n is greater than 1.The Unix timestamp (in seconds) of when the chat completion was created.
The model used for the chat completion.
The object type, which is always
chat.completion.A fingerprint for the model or backend used to generate the response.
The service tier used for the request, or
null if not specified.The service tier used for processing the request. Only present when
service_tier is set to auto in the request.Possible values: priority, default, flexUsage statistics for the completion request.
Performance timing information for the request.

