Supported Models

Qwen 3 235B Thinking is scheduled for deprecation on November 14, 2025

Production Models

Production models are fully supported offerings intended for use in production environments.

Model Name	Model ID	Parameters	Speed (tokens/s)
Llama 3.1 8B	`llama3.1-8b`	8 billion	~2200
Llama 3.3 70B	`llama-3.3-70b`	70 billion	~2100
OpenAI GPT OSS	`gpt-oss-120b`	120 billion	~3000
Qwen 3 32B	`qwen-3-32b`	32 billion	~2600

Preview Models

Preview models are hosted on Cerebras with full accuracy and performance. Please note that these preview models are intended for evaluation purposes only and should not be used in production, as they may be discontinued with short notice.

Model Name	Model ID	Parameters	Speed (tokens/s)
Qwen 3 235B Instruct	`qwen-3-235b-a22b-instruct-2507`	235 billion	~1400
Qwen 3 235B Thinking	`qwen-3-235b-a22b-thinking-2507`	235 billion	~1700
Z.ai GLM 4.6	`zai-glm-4.6`	357 billion	~1000

Model Compression

We host a variety of open-source models from the community. You can refer to the links provided below for the exact architectures and weights that we serve. This section provides transparency about the compression state of each model available on our platform. We do not currently host pruned models on our public endpoints. All models served through our public endpoints are the original, unpruned versions. While we conduct research on pruning techniques like REAP (Router-weighted Expert Activation Pruning), these pruned models are shared with the research community on Hugging Face but are not available through our shared API. You can read more about REAP in our research blog. The table below shows the precision state for each model available on our platform. All models listed are unpruned.

Model Name	Precision	Hugging Face Link
`llama3.1-8b`	FP16	View →
`llama-3.3-70b`	FP16	View →
`gpt-oss-120b`	FP8 (weight only)	View →
`qwen-3-32b`	FP16	View →
`qwen-3-235b-a22b-instruct-2507`	FP8 (weight only)	View →
`qwen-3-235b-a22b-thinking-2507`	FP8 (weight only)	View →
`zai-glm-4.6`	FP8 (weight only)	View →

Frequently Asked Questions

Will you change a model's compression without notice?

No. We are committed to serving the original models for all existing endpoints without modification. We do not alter model architectures or compression settings. If we explore additional compression techniques like pruning in the future, these would be offered as separate endpoints with pruning-specific names, ensuring complete transparency and allowing you to choose which version best fits your needs.

Where can I find your REAP pruned models?

Our REAP pruned models are available on Hugging Face for research and experimentation purposes: Cerebras REAP Collection. These models demonstrate our pruning research but are not served through our production API.

Get Started

Capabilities

Resources

Support

Production Models

Preview Models

Model Compression

Frequently Asked Questions

Get Started

Capabilities

Resources

Support

​Production Models

​Preview Models

​Model Compression

​Frequently Asked Questions

Production Models

Preview Models

Model Compression

Frequently Asked Questions