Learn how to migrate to Z.ai GLM 4.7 on the Cerebras API, including reasoning controls, streaming, and updated limits.
What’s new in GLM 4.7GLM 4.7 introduces key improvements over 4.6:
Enhanced coding performance and agentic tool usage
Stronger reasoning capabilities
Improved role play and general chat quality
GLM 4.7 is now the top open-source model on the Artificial Analysis Intelligence Index, surpassing Kimi K2 Thinking and DeepSeek 3.2. It leads on benchmarks like tau-bench and SWE-bench. The architecture is unchanged, with just updated weights and new API features, making migration straightforward.
This guide covers how to update your API calls, parameters, and prompts for GLM 4.7.
Architecture: Built on the GLM-4.x foundation using a Mixture-of-Experts (MoE) Transformer architecture.
Efficiency:358.0B total parameters, with ~32B active per forward pass via MoE routing.
Open source: Released under an MIT-style permissive license, enabling fine-tuning, self-hosting, and flexible deployment, subject to the terms in the official repository.
Data privacy: When you run GLM 4.7 on Cerebras Inference, your inputs and outputs are processed in memory and never persisted.
GLM 4.7 is a foundation model from Zhipu AI (Z.ai) built for coding and agentic workflows. It offers strong code generation, reasoning, and tool-use capabilities, along with new thinking controls (interleaved, preserved, and turn-level) that improve stability in multi-turn tasks.
GLM 4.6 was already a top-performing open model for code generation. GLM 4.7 extends that lead with substantial gains on GPQA and AIME, outperforming Claude Sonnet 4.5 on both.Source: Artificial Analysis Intelligence Index (as of 12/30/25)On LiveCodeBench, GLM 4.7 outperforms Anthropic and OpenAI models, trailing only Gemini 3.Source: Artificial Analysis Intelligence Index (as of 12/30/25)The model also improves significantly in chat, creative writing, and role-play.Source: Z.ai — GLM 4.7
Use stream=true for incremental output. If reasoning traces are enabled and preserved, they may appear in the streaming delta.reasoning field (not delta.reasoning_content).If you use tool calling with streaming, be prepared to concatenate partial delta.tool_calls[*].function.arguments chunks.
Report incorrect code
Copy
Ask AI
import osfrom cerebras.cloud.sdk import Cerebrasclient = Cerebras(api_key=os.environ.get("CEREBRAS_API_KEY"))stream = client.chat.completions.create( model="zai-glm-4.7", messages=[{"role": "user", "content": "Write a concise migration plan."}], stream=True, max_completion_tokens=4_000, clear_thinking=False,)for chunk in stream: delta = chunk.choices[0].delta if getattr(delta, "reasoning", None): print(delta.reasoning, end="") if getattr(delta, "content", None): print(delta.content, end="")
When migrating to GLM 4.7, a common mistake is reusing old prompts without adjusting them for the model’s preferred prompting style and reasoning/streaming behavior.To fully leverage this model’s strengths, refine prompts, tool-calling flows, and sampling parameters accordingly.
1. Front-load instructions
GLM 4.7 places heightened attention on the beginning of the prompt. To ensure consistent instruction following, place all required rules, constraints, and behavioral instructions at the beginning of the system prompt.GLM 4.7 supports long context (up to ~131k on Cerebras), but instruction-following quality typically peaks at much shorter lengths and can degrade near the maximum.This is especially important when using prompting patterns that rely on “think” tags.
2. Use clear and direct instructions
GLM 4.7 responds more reliably to explicit rules than to suggestive or optional language.
Use unambiguous terms such as MUST, REQUIRED, or STRICTLY.
Avoid soft phrasing such as “Please try to…” or indirect suggestions.
For example:
Do: “Before writing any code, you MUST first read and fully comprehend the architecture.md file. All code you generate must strictly conform…”
Don’t: “Please read and follow my architecture.md…”
3. Specify a default language
Because GLM 4.7 is multilingual, it may occasionally switch languages if not instructed otherwise. Explicit language control prevents this behavior.Add a directive like “Always respond in English” (or your preferred language) in your system prompt to prevent unexpected responses or reasoning traces in other languages.
4. Use role prompts intentionally
GLM 4.7 follows roles and personas closely. Assigning clear roles improves consistency and accuracy.Example: "You are a senior software architect. Review the following specifications and produce a structured design proposal."Role-based prompting also works well in multi-agent systems, with each agent having its own persona.
5. Use critic agents for validation
When building agentic systems, rather than relying on a single agent to both generate and validate code, create dedicated critics to review and validate outputs before allowing the main agentic flow to advance in its plan.These could include:
Code reviewer: A sub-agent configured to rigorously check for code quality, adherence to SOLID/DRY/YAGNI principles, and maintainability issues.
QA tester: Potentially bound with agentic browser capabilities to test user flows, edge cases, and integration points.
Security reviewer: Specialized in identifying vulnerabilities, unsafe patterns, and compliance issues.
Performance analyst: Focused on detecting performance bottlenecks, inefficient algorithms, or resource leaks.
This pattern improves reliability and aligns well with GLM 4.7’s behavior. Multi-agent frameworks like Code Puppy, Kilo/Roo Code, and others support this approach.
6. Break down tasks
Even with improved stability and thinking controls, you will generally get better reliability by breaking complex work into small, well-defined substeps.For example:
List dependencies
Propose new structure
Generate code
Verify output
7. Minimize reasoning when not needed
GLM 4.7 may generate verbose reasoning blocks that are unnecessary and slow down responses.Treat reasoning as a resource: disable it for simple tasks to reduce latency, and preserve it only when it improves quality or your workflow depends on it.We recommend the following:
Disable reasoning with the nonstandard disable_reasoning: true parameter. See our Reasoning guide for more information.
This is different from the thinking parameter that Z.ai uses in their API.
Preserve reasoning traces with clear_thinking: false for agentic/coding workflows and prompt caching use cases.
Set appropriate max_completion_tokens limits. For focused responses, consider using lower values.
Use prompt-based control by adding instructions to minimize reasoning in your system prompt. For example: “Reason only when necessary” or “Skip reasoning for straightforward tasks.”
Use structured output formats (JSON, lists, bullets) that naturally discourage verbose reasoning blocks.
8. Enable enhanced reasoning for complex tasks
For tasks requiring deeper analysis:
Ensure disable_reasoning is false or omitted.
Add reasoning directives such as:
“Think step by step.”
“Break the problem down logically.”
Include examples that demonstrate the reasoning process you want, showing the model how to work through problems methodically.
9. Combine GLM 4.7 with frontier models when needed
If your workload includes tasks requiring frontier-level reasoning accuracy, consider hybrid architectures:
Route simpler tasks to GLM 4.7 and use a frontier model for more complex queries.
Use GLM 4.7 as a fast agent that loops in frontier models only when needed.
Use a frontier model to create a plan, then execute it rapidly with GLM 4.7.
This approach reduces cost and latency while maintaining high accuracy where required.
10. Tune sampling parameters
Parameter tuning has a significant impact on output quality. The recommended defaults from Z.ai and Cerebras are:
Like GLM 4.6, you can disable reasoning by setting disable_reasoning: true.We also support ZAI’s “preserved thinking” behavior via clear_thinking, which controls whether reasoning content is cleared or retained across turns in multi-turn workflows (including tool-calling loops).
[Default] Exclude thinking from earlier turns: clear_thinking: true
[Recommended for coding/agentic + better cache hit rates] Preserve thinking from previous turns: clear_thinking: false
Report incorrect code
Copy
Ask AI
resp = client.chat.completions.create( model="zai-glm-4.7", messages=[{"role": "user", "content": "Help me refactor this function."}], temperature=1, top_p=0.95, disable_reasoning=False, clear_thinking=False,)
What is clear_thinking?
Starting with GLM 4.5, Z.ai introduced support for Interleaved Thinking, allowing the model to think between tool calls and after receiving tool results. GLM 4.7 further enhances Interleaved Thinking and introduces Preserved Thinking and Turn-level Thinking.
Feature
GLM-4.5
GLM-4.6
GLM-4.7
Interleaved Thinking
✅ Introduced
✅ Supported
✅ Enhanced
Preserved Thinking
❌
❌
✅ New
Turn-level Thinking
❌
❌
✅ New
Preserved Thinking (clear_thinking: false): retain reasoning across turns for multi-step coding/agentic workflows
Note: Setting clear_thinking: false can improve cache hit rate in agent loops
What is Preserved Thinking?
Preserved Thinking is the ability to maintain a model’s reasoning context across multiple API calls, particularly during multi-step tool-calling workflows. Without it, when you send tool results back to the model, it may need to re-derive its approach from scratch, which can introduce inconsistencies.Enable preserved thinking with zai-glm-4.7 by setting clear_thinking: false (it’s true by default).This is becoming a common pattern for production agents across providers, though each implements it differently (for example: encrypted “thought tokens”, server-side state, or stateless encrypted blobs).
Why does GLM-4.7 matter?
GLM-4.7 is a top-tier open model that targets state-of-the-art performance on agentic and coding applications in real workloads. It offers high coding precision, strong tool use, and very high generation speed—while keeping open weights.
Why will GLM-4.7 be a strong coding model?
GLM-4.7 performs well across benchmark tasks and real-world coding flows (code generation, editing, and tool-based agent loops), while producing readable, human-like output.
What are its best use cases?
Live coding assistants
Debugging and refactoring agents
Chat + RAG workflows
Tool-using agents (when you provide tool schemas)
What’s the API model ID?
Use zai-glm-4.7 with the Cerebras Chat Completions API.
What parameters should I use?
Recommended defaults:
temperature: 1.0
top_p: 0.95
clear_thinking: false for coding/agentic workflows (and improved cache hit rates)
If verbosity is an issue, set disable_reasoning: true and/or reduce max_completion_tokens.
What’s the context window size?
Cerebras supports up to 131k-token context (131,072 tokens) per request.
How does our tool streaming work?
We don’t support tool_stream=true. We do support stream=true.For tool calls, our streaming behavior is:
Stream reasoning and/or text token-by-token (as available)
Stream tool call payloads as a single chunk (same limitation as other models)
Can I cache prompts?
Yes. Prompt caching is supported for enterprise users. Contact your Cerebras Solutions Architect to enable it on your workspace.Learn more: Prompt Caching
Does it use tools?
Yes—GLM-4.7 supports tool calling via the standard tools=[...] schema. You define the tools and arguments schema; the model decides when to call them.Learn more: Tool Calling
How does GLM 4.7 perform on 3rd party evaluations?
On many development tasks, GLM-4.7 can be comparable to frontier models, while often being significantly faster. On the most complex reasoning-heavy code tasks, developers may still prefer the strongest frontier models.