Prompt caching is enabled by default for the following models:
How It Works
Unlike other providers that require manual cache breakpoints or header modifications, Cerebras Prompt Caching works automatically on all supported API requests. No code changes are required.- Prefix Matching: When you send a request, the system analyzes the beginning of your prompt (the prefix). This includes system prompts, tool definitions, and few-shot examples.
- Block-Based Caching: The system processes prompts in blocks (typically 100–600 tokens). If a block matches a segment stored in our ephemeral memory from a recent request within your organization, the computation is reused.
- Cache Hit: Reusing cached blocks skips the processing phase for those tokens, resulting in lower latency.
- Cache Miss: If no match is found, the prompt is processed as normal, and the prefix is stored in the cache for potential future matches.
- Automatic Expiration: Cached data is ephemeral. We guarantee a Time-To-Live (TTL) of 5 minutes, though caches may persist up to 1 hour depending on system load.
To get a cache hit, the entire beginning of your prompt must match exactly with a previously cached prefix. Even a single character difference in the first token will result in a cache miss for that block and all subsequent blocks.
Structuring Prompts for Caching
To maximize cache hits and minimize latency, organize your prompts with static content first and dynamic content last. The system caches prompts from the beginning of the message. If you place dynamic content (like a timestamp or a unique User ID) at the start of the prompt, the prefix will differ for every request and the cache will never be triggered.1
Static Content First
Place content that remains the same across multiple requests at the beginning:
- System instructions (“You are a helpful assistant…”)
- Tool definitions and schemas
- Few-shot examples
- Large context documents (e.g., a legal agreement or code base)
2
Dynamic Content Last
Place content that changes with each request at the end:
- User-specific questions
- Session variables
- Timestamps
- Optimized (Cache Hit)
- Inefficient (Cache Miss)
The “You are a coding assistant…” instruction block remains static and can be cached in subsequent requests. Only the short timestamp and user query are processed fresh.
Result: The static portion of the system prompt is cached. Subsequent requests reuse the cache and only process the timestamp and user query.
Track Cache Usage
Verify if your requests are hitting the cache by viewing thecached_tokens field within the usage.prompt_token_details response object. This indicates the number of prompt tokens that were found in the cache.
FAQs
Do cached tokens count toward rate limits?
Do cached tokens count toward rate limits?
Yes. All cached tokens contribute to your standard Tokens Per Minute (TPM) rate limits.Calculation:
cached_tokens + input_tokens (fresh) = Total TPM usage for that request.How are cached tokens priced?
How are cached tokens priced?
There is no additional fee for using prompt caching. Input tokens, whether served from the cache or processed fresh, are billed at the standard input token rate for the respective model.
I'm sending the same request but not seeing it being cached. Why is that?
I'm sending the same request but not seeing it being cached. Why is that?
There are three common reasons for a cache miss on identical requests:
- Block Size: We cache in “blocks” (typically 100-600 tokens). If a request or prefix is shorter than the minimum block size, it may not be cached.
- Data Center Routing: While we make a best effort to route you to the same data center, traffic profiles may occasionally route you to a different location where your cache does not exist.
- TTL Expiration: If requests are sent more than 5 minutes apart, the cache may have been evicted.
Is prompt caching enabled for all customers?
Is prompt caching enabled for all customers?
Yes, prompt caching is automatically enabled for all users for the supported models.
Is prompt caching secure?
Is prompt caching secure?
Yes, it is fully ZDR-compliant. All cached context remains ephemeral in memory and never persisted. Cached tokens are stored in key-value stores colocated in the same data center as the model instance serving your traffic.
How is data privacy maintained for caches?
How is data privacy maintained for caches?
Prompt caches are never shared between organizations. Only members of your organization can benefit from caches created by identical prompts within your team.
Does prompt caching affect output quality or speed?
Does prompt caching affect output quality or speed?
Caching only affects the input processing phase (how we read your prompt). The output generation phase remains exactly the same speed and quality. You will receive the same quality response, just with faster prompt processing.
Can I manually clear the cache?
Can I manually clear the cache?
No manual cache management is required or available. The system automatically manages cache eviction based on the TTL (5 minutes to 1 hour).
What are the TTL guarantees?
What are the TTL guarantees?
Guaranteed TTL is 5 minutes, but up to 1 hour max depending on system load.
How can I tell when caching is working?
How can I tell when caching is working?
Check the
usage.prompt_tokens_details.cached_tokens field in your API response. When it’s greater than 0, caching was used for that request.Additionally, log in to cloud.cerebras.ai and click Analytics to track your cache usage.
