Skip to main content
Reasoning capabilities are currently available for the OpenAI GPT OSS (gpt-oss-120b), Qwen3 (qwen3-32b), and Z.ai GLM (zai-glm-4.6, zai-glm-4.7) models. Each model family has slight variations in the parameters used to control reasoning.

Reasoning Format

Control how reasoning text appears in responses using the reasoning_format parameter.

Available Formats

FormatDescription
parsedReasoning returned in separate reasoning field; logprobs separated into reasoning_logprobs
rawReasoning prepended to content; GLM and Qwen use <think>...</think> tokens, GPT-OSS concatenates without separators
hiddenReasoning text and logprobs dropped completely (tokens still counted)
noneUses model’s default behavior

Default Behavior by Model

When reasoning_format is set to none or omitted, each model uses its default format:
ModelDefault Reasoning Format
Qwen3raw (hidden for JSON object/schema)
GLMtext_parsed
GPT-OSStext_parsed

parsed Format

Reasoning text is returned in a separate reasoning field without start/end tokens. When logprobs are enabled, reasoning logprobs are returned in a separate reasoning_logprobs field.
from cerebras.cloud.sdk import Cerebras

client = Cerebras()

response = client.chat.completions.create(
    model="zai-glm-4.7",
    messages=[
        {
            "role": "user",
            "content": "Can you help me with this?"
        }
    ],
    logprobs=True,
    reasoning_format="parsed"
)

print(response)
When streaming, reasoning tokens are delivered in the reasoning field of the delta.

raw Format

Reasoning text is included in the content field, prepended to the response. For GLM and Qwen models, reasoning is wrapped in <think>...</think> tokens. All logprobs are returned together in the standard logprobs field.
Since GPT-OSS does not use thinking tokens, reasoning and content are concatenated without separators when using raw format.
The raw format is not compatible with json_object or json_schema response formats. Models that default to raw will automatically use hidden instead when structured output is requested.
from cerebras.cloud.sdk import Cerebras

client = Cerebras()

response = client.chat.completions.create(
    model="zai-glm-4.7",
    messages=[
        {
            "role": "user",
            "content": "Can you help me with this?"
        }
    ],
    logprobs=True,
    reasoning_format="raw"
)

print(response)

hidden Format

Reasoning text and reasoning logprobs are dropped completely from the response. The reasoning tokens are still generated and counted toward total completion tokens.
from cerebras.cloud.sdk import Cerebras

client = Cerebras()

response = client.chat.completions.create(
    model="zai-glm-4.7",
    messages=[
        {
            "role": "user",
            "content": "Can you help me with this?"
        }
    ],
    logprobs=True,
    reasoning_format="hidden"
)

print(response)

Model-Specific Parameters

Each model family has its own parameter for controlling reasoning behavior.
There are key differences between the OpenAI client and the Cerebras SDK when using non-standard OpenAI parameters. These examples use the Cerebras SDK. For more info, see Passing Non-Standard Parameters.

GPT-OSS: reasoning_effort

Use reasoning_effort to control how much reasoning the model performs:
  • "low" - Minimal reasoning, faster responses
  • "medium" - Moderate reasoning (default)
  • "high" - Extensive reasoning, more thorough analysis
response = client.chat.completions.create(
    model="gpt-oss-120b",
    messages=[{"role": "user", "content": "Explain quantum entanglement."}],
    reasoning_effort="medium"
)

GLM: disable_reasoning

Use disable_reasoning to toggle reasoning on or off. Set to true to disable reasoning, or false (default) to enable it.
response = client.chat.completions.create(
    model="zai-glm-4.7",
    messages=[{"role": "user", "content": "Explain how photosynthesis works."}],
    disable_reasoning=False  # Set to True to disable reasoning
)

Reasoning Context Retention

Reasoning tokens are not automatically retained across requests. To maintain awareness of prior reasoning in multi-turn conversations, include the reasoning text in the content field of the assistant message. Use the same format the model outputs: for GLM and Qwen, include reasoning in <think>...</think> tags; for GPT-OSS, prepend reasoning text directly before the answer.
# GPT-OSS: reasoning prepended directly before the answer
response = client.chat.completions.create(
    model="gpt-oss-120b",
    messages=[
        {"role": "user", "content": "What is 25 * 4?"},
        {"role": "assistant", "content": "I need to multiply 25 by 4. 25 * 4 = 100. The answer is 100."},
        {"role": "user", "content": "Now divide that by 2."}
    ]
)