Skip to main content
Qwen 3 235B Thinking is scheduled for deprecation on November 14, 2025

Production Models

Production models are fully supported offerings intended for use in production environments.
Model NameModel IDParametersSpeed (tokens/s)
Llama 3.1 8Bllama3.1-8b8 billion~2200
Llama 3.3 70Bllama-3.3-70b70 billion~2100
OpenAI GPT OSSgpt-oss-120b120 billion~3000
Qwen 3 32Bqwen-3-32b32 billion~2600

Preview Models

Preview models are hosted on Cerebras with full accuracy and performance. Please note that these preview models are intended for evaluation purposes only and should not be used in production, as they may be discontinued with short notice.
Model NameModel IDParametersSpeed (tokens/s)
Qwen 3 235B Instructqwen-3-235b-a22b-instruct-2507235 billion~1400
Qwen 3 235B Thinkingqwen-3-235b-a22b-thinking-2507235 billion~1700
Z.ai GLM 4.6zai-glm-4.6357 billion~1000

Model Compression

We host a variety of open-source models from the community. You can refer to the links provided below for the exact architectures and weights that we serve. This section provides transparency about the compression state of each model available on our platform. We do not currently host pruned models on our public endpoints. All models served through our public endpoints are the original, unpruned versions. While we conduct research on pruning techniques like REAP (Router-weighted Expert Activation Pruning), these pruned models are shared with the research community on Hugging Face but are not available through our shared API. You can read more about REAP in our research blog. The table below shows the precision state for each model available on our platform. All models listed are unpruned.
Model NamePrecisionHugging Face Link
llama3.1-8bFP16View →
llama-3.3-70bFP16View →
gpt-oss-120bFP8 (weight only)View →
qwen-3-32bFP16View →
qwen-3-235b-a22b-instruct-2507FP8 (weight only)View →
qwen-3-235b-a22b-thinking-2507FP8 (weight only)View →
zai-glm-4.6FP8 (weight only)View →

Frequently Asked Questions

No. We are committed to serving the original models for all existing endpoints without modification. We do not alter model architectures or compression settings. If we explore additional compression techniques like pruning in the future, these would be offered as separate endpoints with pruning-specific names, ensuring complete transparency and allowing you to choose which version best fits your needs.
Our REAP pruned models are available on Hugging Face for research and experimentation purposes: Cerebras REAP Collection. These models demonstrate our pruning research but are not served through our production API.