Skip to main content
Rate limits ensure fair usage and system stability by regulating how often users and applications can access our API within a specified timeframe. They help protect our service from abuse or misuse and keep your access fair and without slowdowns.

How are rate limits measured?

We measure rate limits in requests sent and tokens used within a specified timeframe:
  • Requests per minute/hour/day (RPM, RPH, RPD)
  • Tokens per minute/hour/day (TPM, TPH, TPD)
Rate limiting can be triggered by any metric, whichever comes first. For example, you have a rate limit of 50 RPM and 200K TPM. If you submit 50 requests in one minute with just 100 tokens each, you’ll hit your limit even though your total token usage (5,000) is far below the 200K token threshold. Rate limits apply at the organization level, not the user level, and vary based on the model.

Token Rate Limiting

When you send a request, we estimate the total tokens that will be consumed by:
  1. Estimating the input tokens in your prompt
  2. Adding either the max_completion_tokens parameter or the maximum sequence length (MSL), minus input tokens
If this estimated token consumption would exceed your available token quota, the request is rate limited before processing begins. This ensure fair usage and system stability. Best practice: Set max_completion_tokens appropriately for your use case to avoid overestimating token usage and triggering unnecessary rate limits.

Quota Replenishment

Your quota is calculated as:
Available quota = Rate limit - Usage in current time window
We use the token bucketing algorithm for rate limiting, which means your capacity replenishes continuously rather than resetting at fixed intervals. As you consume tokens or requests, your available capacity automatically refills up to your maximum limit. This token bucketing approach ensures smoother API access and prevents the “burst at interval start, then idle” pattern.

Limits by Tier

This provides an overview of general limits, though specific cases may vary. For precise, up-to-date rate limit information applicable to your organization, check the Limits section within your account.
  • Free
  • Developer
ModelTPMTPHTPDRPMRPHRPD
gpt-oss-120b60K1M1M3090014.4K
llama3.1-8b60K1M1M3090014.4K
llama-3.3-70b60K1M1M3090014.4K
qwen-3-32b60K1M1M3009014.4K
qwen-3-235b-a22b-instruct-250760K1M1M3090014.4K
zai-glm-4.6150K1M1M10100100

Rate Limit Headers

To help you monitor your usage in real time, we inject several custom headers into every API response. These headers provide insight into your current usage and when your limits will reset. You’ll find the following headers in the response:
HeaderDescription
x-ratelimit-limit-requests-dayMaximum number of requests allowed per day.
x-ratelimit-limit-tokens-minuteMaximum number of tokens allowed per minute.
x-ratelimit-remaining-requests-dayNumber of requests remaining for the current day.
x-ratelimit-remaining-tokens-minuteNumber of tokens remaining for the current minute.
x-ratelimit-reset-requests-dayTime (in seconds) until your daily request limit resets.
x-ratelimit-reset-tokens-minuteTime (in seconds) until your per-minute token limit resets.
These values update with each API call, giving you immediate visibility into your current usage.

Example

You can view these headers by adding the --verbose flag to a cURL request:
curl --location 'https://api.cerebras.ai/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header "Authorization: Bearer ${CEREBRAS_API_KEY}" \
--data '{
  "model": "llama3.1-8b",
  "stream": false,
  "messages": [{"content": "Hello!", "role": "user"}],
  "temperature": 0,
  "max_completion_tokens": -1,
  "seed": 0,
  "top_p": 1
}' \
--verbose
In the response, look for headers like these:
x-ratelimit-limit-requests-day: 1000000000
x-ratelimit-limit-tokens-minute: 1000000000
x-ratelimit-remaining-requests-day: 999997455
x-ratelimit-remaining-tokens-minute: 999998298
x-ratelimit-reset-requests-day: 33011.382867097855
x-ratelimit-reset-tokens-minute: 11.382867097854614

Notes

If you exceed your rate limits, you will receive a 429 Too Many Requests error.
If you have questions about your usage or need higher rate limits, contact us via our website, or reach out to your account representative.