Skip to main content
POST
https://api.cerebras.ai
/
v1
/
chat
/
completions
from cerebras.cloud.sdk import Cerebras
import os 

client = Cerebras(api_key=os.environ.get("CEREBRAS_API_KEY"),)

chat_completion = client.chat.completions.create(
    model="gpt-oss-120b",
    messages=[
        {"role": "user", "content": "Hello!",}
    ],
)
print(chat_completion)
{
  "id": "chatcmpl-292e278f-514e-4186-9010-91ce6a14168b",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "Hello! How can I assist you today?",
        "reasoning": "The user is asking for a simple greeting to the world. This is a straightforward request that doesn't require complex analysis. I should provide a friendly, direct response.",
        "role": "assistant"
      }
    }
  ],
  "created": 1723733419,
  "model": "gpt-oss-120b",
  "system_fingerprint": "fp_70185065a4",
  "object": "chat.completion",
  "usage": {
    "prompt_tokens": 12,
    "completion_tokens": 10,
    "total_tokens": 22,
    "prompt_tokens_details": {
      "cached_tokens": 0
    },
    "completion_tokens_details": {
      "accepted_prediction_tokens": 0,
      "rejected_prediction_tokens": 0
    }
  },
  "time_info": {
    "queue_time": 0.000073161,
    "prompt_time": 0.0010744798888888889,
    "completion_time": 0.005658071111111111,
    "total_time": 0.022224903106689453,
    "created": 1723733419
  }
}
Generate conversational responses using a structured message format with roles (system, user, assistant). Best for chatbots, assistants, and multi-turn conversations.

Request

Headers

queue_threshold
string
Controls the queue time threshold for requests using the flex or auto service tiers. Requests are preemptively rejected if the rolling average queue time exceeds this threshold.
This feature is in Private Preview. For access or more information, contact us or reach out to your account representative.
Valid range: 50 - 20000 (milliseconds)Default: System default if not specifiedSee Service Tiers for more information.

Body

clear_thinking
boolean | null
Controls whether thinking content from previous conversation turns is included in the prompt context.Note: Thinking content from the current (latest unfinished) turn is always included regardless of this setting.
  • false - Thinking from all previous turns is preserved in the conversation history. Recommended for agentic workflows where reasoning from past tool-calling turns may be relevant for future tool calls.
  • true (default) - Thinking from earlier turns is excluded. Recommended for general chat conversations where reasoning from past turns is less relevant for performance.
When this parameter is not specified or set to null, the API defaults to clear_thinking: true.
This parameter is supported only on the zai-glm-4.7 model. For additional information, see Preserved thinking in the Z.ai documentation.
logprobs
bool
Whether to return log probabilities of the output tokens or not.Default: false
max_completion_tokens
integer | null
The maximum number of tokens that can be generated in the completion, including reasoning tokens. The total length of input tokens and generated tokens is limited by the model’s context length.
messages
object[]
required
A list of messages comprising the conversation so far.Note: System prompts must be passed to the messages parameter as a string. Support for other object types will be added in future releases.
model
string
required
Available options:
  • llama3.1-8b
  • llama-3.3-70b
  • qwen-3-32b
  • qwen-3-235b-a22b-instruct-2507 (preview)
  • gpt-oss-120b
  • zai-glm-4.7 (preview)
parallel_tool_calls
boolean | null
Whether to enable parallel function calling during tool use. When enabled (default), the model can request multiple tool calls simultaneously in a single response. When disabled, the model will only request one tool call at a time.Default: true
prediction
object | null
Configuration for a Predicted Output, which can greatly speed up response times when large parts of the model response are known in advance. This is most common when you are regenerating a file with mostly minor changes to the content.Visit our page on Predicted Outputs for more information and examples.
reasoning_effort
string | null
Controls the amount of reasoning the model performs. Available values:
  • "low" - Minimal reasoning, faster responses
  • "medium" - Moderate reasoning (default)
  • "high" - Extensive reasoning, more thorough analysis
This flag is only available for gpt-oss-120b model.
response_format
object | null
An object that controls the format of the model response.Setting to { "type": "json_schema", "json_schema": { "name": "schema_name", "strict": true, "schema": {...} } } enforces schema compliance by ensuring that the model output conforms to your specified JSON schema. See Structured Outputs for more information.Setting { "type": "json_object" } enables the legacy JSON mode, ensuring that the model output is valid JSON. However, using json_schema is recommended for models that support it.
seed
integer | null
If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed.
service_tier
string | null
Controls request prioritization.
This feature is in Private Preview. For access or more information, contact us or reach out to your account representative.
Available options:
  • priority - Highest priority processing (Only available for dedicated endpoints, not shared endpoints.)
  • default - Standard priority processing
  • auto - Automatically uses the highest available service tier
  • flex - Lowest priority processing
Default: defaultSee Service Tiers for more information.
stop
string | null
Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
stream
boolean | null
If set, partial message deltas will be sent.
temperature
number | null
What sampling temperature to use, between 0 and 1.5. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
top_logprobs
integer | null
An integer between 0 and 20 specifying the number of most likely tokens to return at each token position, each with an associated log probability. logprobs must be set to true if this parameter is used.
top_p
number | null
An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So, 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
tool_choice
string | object
Controls which (if any) tool is called by the model. none means the model will not call any tool and instead generates a message. auto means the model can pick between generating a message or calling one or more tools. required means the model must call one or more tools. Specifying a particular tool via {"type": "function", "function": {"name": "my_function"}} forces the model to call that tool.none is the default when no tools are present. auto is the default if tools are present.
parallel_tool_calls
boolean | null
Whether to enable parallel function calling during tool use. When enabled (default), the model can request multiple tool calls simultaneously in a single response. When disabled, the model will only request one tool call at a time.Default: true
tools
object | null
A list of tools the model may call. Currently, only functions are supported as a tool. Use this to provide a list of functions the model may generate JSON inputs for.Specifying tools consumes prompt tokens in the context. If too many are given, the model may perform poorly or you may hit context length limitations
user
string | null
A unique identifier representing your end-user, which can help to monitor and detect abuse.

Response

id
string
A unique identifier for the chat completion.
choices
object[]
A list of chat completion choices. Can be more than one if n is greater than 1.
created
integer
The Unix timestamp (in seconds) of when the chat completion was created.
model
string
The model used for the chat completion.
object
string
The object type, which is always chat.completion.
service_tier_used
string
The service tier used for processing the request. Only present when service_tier is set to auto in the request.Possible values: priority, default, flex
usage
object
Usage statistics for the completion request.
time_info
object
Performance timing information for the request.
from cerebras.cloud.sdk import Cerebras
import os 

client = Cerebras(api_key=os.environ.get("CEREBRAS_API_KEY"),)

chat_completion = client.chat.completions.create(
    model="gpt-oss-120b",
    messages=[
        {"role": "user", "content": "Hello!",}
    ],
)
print(chat_completion)
{
  "id": "chatcmpl-292e278f-514e-4186-9010-91ce6a14168b",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "Hello! How can I assist you today?",
        "reasoning": "The user is asking for a simple greeting to the world. This is a straightforward request that doesn't require complex analysis. I should provide a friendly, direct response.",
        "role": "assistant"
      }
    }
  ],
  "created": 1723733419,
  "model": "gpt-oss-120b",
  "system_fingerprint": "fp_70185065a4",
  "object": "chat.completion",
  "usage": {
    "prompt_tokens": 12,
    "completion_tokens": 10,
    "total_tokens": 22,
    "prompt_tokens_details": {
      "cached_tokens": 0
    },
    "completion_tokens_details": {
      "accepted_prediction_tokens": 0,
      "rejected_prediction_tokens": 0
    }
  },
  "time_info": {
    "queue_time": 0.000073161,
    "prompt_time": 0.0010744798888888889,
    "completion_time": 0.005658071111111111,
    "total_time": 0.022224903106689453,
    "created": 1723733419
  }
}