Service Tiers

This feature is in Private Preview. For access or more information, contact us or reach out to your account representative.

Prioritize requests on the Cerebras Inference API to balance latency sensitivity and resource allocation across your workloads.

Service tiers determine the processing priority of your requests. You can specify a tier using the service_tier parameter in your API requests.

Tier	Description
`priority`¹	Highest priority - requests are processed first. Use for time-critical, user-facing requests that require immediate processing. Only available for dedicated endpoints, not shared endpoints.
`default`	Standard priority processing. Use for standard production workloads with normal latency requirements.
`auto`	Automatically uses the highest available service tier. Use when you want to maximize requests served while allowing flexibility in processing priority.
`flex`	Lowest priority - requests are processed towards the end. Use for overflow requests that cannot fit in higher service tier rate limits or for experiments.

When no service_tier is specified, requests default to the default tier.

¹ The priority tier requires a dedicated endpoint. If interested, contact your account representative for more information.

Usage

Add the service_tier parameter to your chat completions request to specify the priority level.

from cerebras.cloud.sdk import Cerebras
import os

client = Cerebras(api_key=os.environ.get("CEREBRAS_API_KEY"))

response = client.chat.completions.create(
    model="gpt-oss-120b",
    messages=[{"role": "system", "content": "Best pastries in San Francisco?"}],
    stream=True,
    max_tokens=20000,
    temperature=0.7,
    top_p=0.8,
    service_tier="auto"
)

When using auto, the response will include a service_tier_used field that indicates the effective service tier used for processing.

Queue Threshold Control

Only applies to requests using the flex or auto service tiers.

The queue_threshold header allows you to set a maximum acceptable queue time for flex tier requests. If the expected queue time exceeds your threshold, the request is preemptively rejected rather than waiting in the queue. Valid range: 50-20000 milliseconds

from cerebras.cloud.sdk import Cerebras
import os

client = Cerebras(api_key=os.environ.get("CEREBRAS_API_KEY"))

response = client.chat.completions.create(
    model="gpt-oss-120b",
    service_tier="flex",
    messages=[{"role": "user", "content": "What are the latest AI trends?"}],
    extra_headers={"queue_threshold": "100"}
)

If no threshold is specified, a system default is used.

FAQ

How do rate limits apply across service tiers?

priority and default rate limits are the same, while flex rate limits are tracked independently and are several multiples of default rate limits.

Are priority, flex, or auto logged differently in usage tracking?

Yes. Log in to cloud.cerebras.ai and click Analytics. Graphs in the analytics tab display usage across different service tiers, allowing you to monitor consumption by priority level.

Are priority, flex, or auto billed differently than default?

No, during the preview launch all service tiers are billed equally.

Will my request ever be processed on a lower service tier if I do not set service_tier to auto?

No, only requests set to auto can be processed on a lower service tier.

Can I set queue time threshold on other service tiers?

The queue time threshold only applies once a request is being processed on the flex service tier. You can set it on requests using auto or flex, but it will only be evaluated if the request is processed on the flex tier.

Get Started

Capabilities

Dedicated Endpoints

Compatibility

Resources

Support

Service Tiers

Service Tiers

Usage

Queue Threshold Control

FAQ

Get Started

Capabilities

Dedicated Endpoints

Compatibility

Resources

Support

​Service Tiers

​Usage

​Queue Threshold Control

​FAQ

Service Tiers

Usage

Queue Threshold Control

FAQ