Skip to main content
This feature is in Private Preview. For access or more information, contact us or reach out to your account representative.
Prioritize requests on the Cerebras Inference API to balance latency sensitivity and resource allocation across your workloads.

Service Tiers

Service tiers determine the processing priority of your requests. You can specify a tier using the service_tier parameter in your API requests.
TierDescription
priority1Highest priority - requests are processed first. Use for time-critical, user-facing requests that require immediate processing. Only available for dedicated endpoints, not shared endpoints.
defaultStandard priority processing. Use for standard production workloads with normal latency requirements.
autoAutomatically uses the highest available service tier. Use when you want to maximize requests served while allowing flexibility in processing priority.
flexLowest priority - requests are processed towards the end. Use for overflow requests that cannot fit in higher service tier rate limits or for experiments.
When no service_tier is specified, requests default to the default tier.
1 The priority tier requires a dedicated endpoint. If interested, contact your account representative for more information.

Usage

Add the service_tier parameter to your chat completions request to specify the priority level.
from cerebras.cloud.sdk import Cerebras
import os

client = Cerebras(api_key=os.environ.get("CEREBRAS_API_KEY"))

response = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[{"role": "system", "content": "Best pastries in San Francisco?"}],
    stream=True,
    max_tokens=20000,
    temperature=0.7,
    top_p=0.8,
    service_tier="auto"
)
When using auto, the response will include a service_tier_used field that indicates the effective service tier used for processing.

Queue Threshold Control

Only applies to requests using the flex or auto service tiers.
The queue_threshold header allows you to set a maximum acceptable queue time for flex tier requests. If the expected queue time exceeds your threshold, the request is preemptively rejected rather than waiting in the queue. Valid range: 50-20000 milliseconds
from cerebras.cloud.sdk import Cerebras
import os

client = Cerebras(api_key=os.environ.get("CEREBRAS_API_KEY"))

response = client.chat.completions.create(
    model="llama-3.3-70b",
    service_tier="flex",
    messages=[{"role": "user", "content": "What are the latest AI trends?"}],
    extra_headers={"queue_threshold": "100"}
)
If no threshold is specified, a system default is used.

FAQ

priority and default rate limits are the same, while flex rate limits are tracked independently and are several multiples of default rate limits.
Yes. Log in to cloud.cerebras.ai and click Analytics. Graphs in the analytics tab display usage across different service tiers, allowing you to monitor consumption by priority level.
No, during the preview launch all service tiers are billed equally.
No, only requests set to auto can be processed on a lower service tier.
The queue time threshold only applies once a request is being processed on the flex service tier. You can set it on requests using auto or flex, but it will only be evaluated if the request is processed on the flex tier.