Dedicated Endpoints

A dedicated endpoint is a private, provisioned instance of the Cerebras Inference service reserved exclusively for your organization. Your traffic runs on reserved capacity, ensuring latency and throughput are not affected by other users. Dedicated endpoints are intended for production workloads that require predictable performance—such as real-time applications, customer-facing products, and high-volume pipelines that need guaranteed capacity. See supported models here. Key Benefits

Dedicated capacity

Your endpoint runs on reserved capacity that is not shared with other customers, so your performance is never impacted by other workloads.

Consistent latency and throughput

Performance is reserved and predictable, even under load.

Bring your own weights

Deploy your custom fine-tuned models alongside standard model variants.

Performance customization

Tailor your endpoint to match the performance and scale requirements of your workload through bespoke draft models, model configurations, and quantization strategies.

Exclusive access to advanced features

All capabilities available on shared endpoints are available on dedicated endpoints. In addition, dedicated customers get access to advanced features including fine-tuning, weight management, and enhanced service tier controls.

To get started with a dedicated endpoint, contact us.

Supported Models

Dedicated endpoints support a broad range of model families, including multiple versions, parameter sizes, and weight variations (e.g., -instruct and -thinking) as well as your own custom weights. We can also work with you to tune your endpoint configuration to meet your specific performance goals.

https://mintcdn.com/cerebras-inference/k52j3v9j3Q8jlYv7/images/icons/qwen.svg?fit=max&auto=format&n=k52j3v9j3Q8jlYv7&q=85&s=5f17b5e6222779e638a8c283f8e511a8

Alibaba Qwen — Qwen3, Qwen3-Coder

Qwen3-235B-A22B

Qwen3-32B

Qwen/Qwen3-32B

Qwen3-30B-A3B

Small & Tiny Variants

Qwen3-Coder

OpenAI (OSS) — GPT-OSS

https://mintcdn.com/cerebras-inference/k52j3v9j3Q8jlYv7/images/icons/minimax.svg?fit=max&auto=format&n=k52j3v9j3Q8jlYv7&q=85&s=bb972b9e03b7f218e5d2776a141c29cb

MiniMax — MiniMax M2.X

Google — Gemma 4

google/gemma-4-31b-it

Meta — Llama 3, Llama 4

meta-llama/Llama-4-Maverick-17B-128E-Instruct (402B total)
meta-llama/Llama-4-Scout-17B-16E-Instruct (109B total)
meta-llama/Llama-3.3-70B-Instruct

https://mintcdn.com/cerebras-inference/k52j3v9j3Q8jlYv7/images/icons/mistralai.svg?fit=max&auto=format&n=k52j3v9j3Q8jlYv7&q=85&s=5e1e79f16460060e9825edacec580636

Mistral — Mistral Small, Mistral Large 3, Devstral 2, Mixtral

https://mintcdn.com/cerebras-inference/k52j3v9j3Q8jlYv7/images/icons/zai.svg?fit=max&auto=format&n=k52j3v9j3Q8jlYv7&q=85&s=06535f5280d8d19ba786406c66be6cf4

Z.AI — GLM 4.X, GLM 5.X

https://mintcdn.com/cerebras-inference/k52j3v9j3Q8jlYv7/images/icons/moonshot.svg?fit=max&auto=format&n=k52j3v9j3Q8jlYv7&q=85&s=580edaa8ea7ace796f411c7244dddb25

Moonshot AI — Kimi K2.X

https://mintcdn.com/cerebras-inference/k52j3v9j3Q8jlYv7/images/icons/deepseek.svg?fit=max&auto=format&n=k52j3v9j3Q8jlYv7&q=85&s=538f34c14c52d24010db84d0332cce36

DeepSeek — DeepSeek V3.X

https://mintcdn.com/cerebras-inference/5JGSuJfumLWIlNYj/images/icons/stepfun.svg?fit=max&auto=format&n=5JGSuJfumLWIlNYj&q=85&s=37cf81dd7ac6789eb7017fc053d0e816

StepFun — Step 3.X Flash

https://mintcdn.com/cerebras-inference/k52j3v9j3Q8jlYv7/images/icons/bytedance.svg?fit=max&auto=format&n=k52j3v9j3Q8jlYv7&q=85&s=d90cbc6eb08537be002ff8bcfe3d0ca6

ByteDance — OSS Seed

ByteDance-Seed/Seed-OSS-36B-Instruct

https://mintcdn.com/cerebras-inference/k52j3v9j3Q8jlYv7/images/icons/servicenow.svg?fit=max&auto=format&n=k52j3v9j3Q8jlYv7&q=85&s=9b7890d5d3b44e7363d4a70ca3b6b7b1

ServiceNow — Apriel

ServiceNow-AI/Apriel-1.6-15b-Thinker

Coming soon: multimodal

Qwen3-VL
Kimi K2.6
Pixtral Large

Features

Dedicated endpoints include all shared endpoints capabilities, plus:

Fine-tuning — Deploy custom model weights on your dedicated endpoint.
Management API — Programmatically manage models, capacity, and endpoints.
Batch API — Run large-scale asynchronous workloads against your reserved capacity.
Service tiers — Configure request prioritization to match your SLA requirements.
Metrics — Monitor your endpoint with Prometheus-compatible metrics for requests, tokens, latency, and health.

Get Started

Dedicated endpoints are available to enterprise customers. Contact us to discuss your requirements.

Get Started

Models

Capabilities

Compatibility

Cloud Console

Resources

Support

Dedicated Endpoints

Supported Models

Features

Get Started

​Supported Models

​Features

​Get Started

Supported Models

Features

Get Started