> ## Documentation Index
> Fetch the complete documentation index at: https://inference-docs.cerebras.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Service Tiers

> Control request prioritization with service tiers.

<Callout icon="lock" color="#b2b1b1ff" iconType="regular">
  This feature is in [Private Preview](/support/preview-releases). For access or more information, [contact us](https://www.cerebras.ai/contact) or reach out to your account representative.
</Callout>

Prioritize requests on the Cerebras Inference API to balance latency sensitivity and resource allocation across your workloads.

## Service Tiers

Service tiers determine the processing priority of your requests. You can specify a tier using the [`service_tier`](/api-reference/chat-completions#param-service-tier) parameter in your API requests.

| Tier       | Description                                                                                                                                                                                   |
| ---------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `priority` | Highest priority - requests are processed first. Use for time-critical, user-facing requests that require immediate processing. Only available for dedicated endpoints, not shared endpoints. |
| `default`  | Standard priority processing. Use for standard production workloads with normal latency requirements.                                                                                         |
| `auto`     | Automatically uses the highest available service tier. Use when you want to maximize requests served while allowing flexibility in processing priority.                                       |
| `flex`     | Lowest priority - requests are processed towards the end. Use for overflow requests that cannot fit in higher service tier rate limits or for experiments.                                    |

When no `service_tier` is specified, requests default to the `default` tier.

## Usage

Add the `service_tier` parameter to your chat completions request to specify the priority level.

<CodeGroup>
  ```python Python highlight={13} theme={null}
  from cerebras.cloud.sdk import Cerebras
  import os

  client = Cerebras(api_key=os.environ.get("CEREBRAS_API_KEY"))

  response = client.chat.completions.create(
      model="gpt-oss-120b",
      messages=[{"role": "system", "content": "Best pastries in San Francisco?"}],
      stream=True,
      max_tokens=20000,
      temperature=0.7,
      top_p=0.8,
      service_tier="auto"
  )
  ```

  ```javascript Node.js highlight={14} theme={null}
  import Cerebras from '@cerebras/cerebras_cloud_sdk';

  const client = new Cerebras({
    apiKey: process.env['CEREBRAS_API_KEY'],
  });

  const response = await client.chat.completions.create({
    model: "gpt-oss-120b",
    messages: [{ role: "system", content: "Best pastries in San Francisco?" }],
    stream: true,
    max_tokens: 20000,
    temperature: 0.7,
    top_p: 0.8,
    service_tier: "auto"
  });
  ```

  ```bash cURL highlight={10} theme={null}
  curl --location 'https://api.cerebras.ai/v1/chat/completions' \
  --header 'Content-Type: application/json' \
  --header "Authorization: Bearer ${CEREBRAS_API_KEY}" \
  --data '{
    "model": "llama-3.3-70b",
    "stream": true,
    "max_tokens": 20000,
    "temperature": 0.7,
    "top_p": 0.8,
    "service_tier": "auto",
    "messages": [
      {
        "role": "system",
        "content": "Best pastries in San Francisco?"
      }
    ]
  }'
  ```
</CodeGroup>

When using `auto`, the response will include a `service_tier_used` field that indicates the effective service tier used for processing.

## Queue Threshold Control

<Note>Only applies to requests using the `flex` or `auto` service tiers.</Note>

The [`queue_threshold`](/api-reference/chat-completions#param-queue-threshold) header allows you to set a maximum acceptable queue time for flex tier requests. If the expected queue time exceeds your threshold, the request is preemptively rejected rather than waiting in the queue.

**Valid range:** 50-20000 milliseconds

<CodeGroup>
  ```python Python highlight={10} theme={null}
  from cerebras.cloud.sdk import Cerebras
  import os

  client = Cerebras(api_key=os.environ.get("CEREBRAS_API_KEY"))

  response = client.chat.completions.create(
      model="gpt-oss-120b",
      service_tier="flex",
      messages=[{"role": "user", "content": "What are the latest AI trends?"}],
      extra_headers={"queue_threshold": "100"}
  )
  ```

  ```javascript Node.js highlight={12} theme={null}
  import Cerebras from '@cerebras/cerebras_cloud_sdk';

  const client = new Cerebras({
    apiKey: process.env['CEREBRAS_API_KEY'],
  });

  const response = await client.chat.completions.create({
    model: "gpt-oss-120b",
    service_tier: "flex",
    messages: [{ role: "user", content: "What are the latest AI trends?" }]
  }, {
    headers: { "queue_threshold": "100" }
  });
  ```

  ```bash cURL highlight={4} theme={null}
  curl https://api.cerebras.ai/v1/chat/completions \
    -H "Authorization: Bearer $CEREBRAS_API_KEY" \
    -H "Content-Type: application/json" \
    -H "queue_threshold: 100" \
    -d '{
      "model": "llama-3.3-70b",
      "service_tier": "flex",
      "messages": [
        {
          "role": "user",
          "content": "What are the latest AI trends?"
        }
      ]
    }'
  ```
</CodeGroup>

If no threshold is specified, a system default is used.

## FAQ

<AccordionGroup>
  <Accordion title="How do rate limits apply across service tiers?">
    `priority` and `default` rate limits are the same, while `flex` rate limits are tracked independently and are several multiples of `default` rate limits.
  </Accordion>

  <Accordion title="Are priority, flex, or auto logged differently in usage tracking?">
    Yes. Log in to [cloud.cerebras.ai](https://cloud.cerebras.ai) and click **Analytics**. Graphs in the analytics tab display usage across different service tiers, allowing you to monitor consumption by priority level.
  </Accordion>

  <Accordion title="Are priority, flex, or auto billed differently than default?">
    No, during the preview launch all service tiers are billed equally.
  </Accordion>

  <Accordion title="Will my request ever be processed on a lower service tier if I do not set service_tier to auto?">
    No, only requests set to `auto` can be processed on a lower service tier.
  </Accordion>

  <Accordion title="Can I set queue time threshold on other service tiers?">
    The queue time threshold only applies once a request is being processed on the `flex` service tier. You can set it on requests using `auto` or `flex`, but it will only be evaluated if the request is processed on the flex tier.
  </Accordion>
</AccordionGroup>
