This feature is designed to significantly reduce Time to First Token (TTFT) and improve responsiveness for long-context workloads, such as multi-turn conversations, RAG (Retrieval Augmented Generation), and agentic workflows.Documentation Index
Fetch the complete documentation index at: https://inference-docs.cerebras.ai/llms.txt
Use this file to discover all available pages before exploring further.
How It Works
Unlike other providers that require manual cache breakpoints or header modifications, Cerebras Prompt Caching works automatically on all supported API requests. No code changes are required.- Prefix Matching: When you send a request, the system analyzes the beginning of your prompt (the prefix). This includes system prompts, tool definitions, and few-shot examples.
- Block-Based Caching: The system processes prompts in 128-token blocks. If a block matches a segment stored in our ephemeral memory from a recent request within your organization, the computation is reused.
- Cache Hit: Reusing cached blocks skips the processing phase for those tokens, resulting in lower latency.
- Cache Miss: If no match is found, the prompt is processed as normal, and the prefix is stored in the cache for potential future matches.
- Automatic Expiration: Cached data is ephemeral. We guarantee a Time-To-Live (TTL) of 5 minutes, though caches may persist up to 1 hour depending on system load.
Example: Multi-Turn Conversation with Tool Calling
In this scenario, a shopping assistant helps users check order status and cancel orders using two tools:get_order_status and cancel_order. The system message and tool definitions remain constant across turns and are cached, while the conversation progresses naturally.
- System message: The shopping assistant instructions remain identical across all turns
- Tool definitions: Both order management tool schemas (including parameters and descriptions) stay constant
- Conversation history: Previous user messages, assistant responses, and tool results are all cached as the conversation grows
- New user messages (the latest question)
- New tool execution results
- The model’s reasoning and decision-making for the current turn
Structuring Prompts for Caching
To maximize cache hits and minimize latency, organize your prompts with static content first and dynamic content last. The system caches prompts from the beginning of the message. If you place dynamic content (like a timestamp or a unique User ID) at the start of the prompt, the prefix will differ for every request and the cache will never be triggered.Static Content First
- System instructions (“You are a helpful assistant…”)
- Tool definitions and schemas
- Few-shot examples
- Large context documents (e.g., a legal agreement or code base)
- Optimized (Cache Hit)
- Inefficient (Cache Miss)
Maximize Cache Hits with prompt_cache_key
prompt_cache_key must be enabled on your account before you can use it. Contact us or reach out to your account representative to request access.prompt_cache_key on each request. This is an optional opaque string that tells the system which requests are likely to share a common prompt prefix, so it can route those requests to the same prompt cache.
Prompt caching and prefix matching work automatically on every request, so you benefit from caching even without setting prompt_cache_key. Set it when you want to give the system an explicit hint that keeps related requests on the same prompt cache. Under load, turn 1 of a session can be routed to one prompt cache and turn 2 to another, causing a cache miss even though the prefixes match. Passing the same prompt_cache_key on every turn tells the system to route them together on a best-effort basis.
When to set it
Choose a stable identifier that represents the shared context across your related requests:- Multi-turn chat session: Use the conversation ID so every turn in the same conversation uses the same prompt cache.
- Per-user workload over a shared system prompt: Use the user or session ID.
- RAG or agentic workload with a shared prefix: Use a hash of the static portion – system prompt, tool definitions, few-shot examples, and any shared reference document.
Guidance
- Reuse the same key for related requests. A one-off unique value provides no improvement over omitting the field.
- Rotate the key when the shared prefix changes. If you change the system prompt or tool definitions, use a new
prompt_cache_key. Otherwise the system will keep routing requests to a prompt cache whose contents no longer match. - Maximum length: 1024 characters. Longer values are rejected with a 400 error.
prompt_cache_key value is hashed before it’s written to internal logs or metrics.prompt_cache_key is optional and does not change billing – cached tokens are priced the same whether the key is set or not.
Track Cache Usage
Verify if your requests are hitting the cache by viewing thecached_tokens field within the usage.prompt_token_details response object. This indicates the number of prompt tokens that were found in the cache.
FAQs
Do cached tokens count toward rate limits?
Do cached tokens count toward rate limits?
cached_tokens + input_tokens (fresh) = Total TPM usage for that request.How are cached tokens priced?
How are cached tokens priced?
I'm sending the same request but not seeing it being cached. Why is that?
I'm sending the same request but not seeing it being cached. Why is that?
- Block Size: We cache in 128-token blocks. If a request or prefix is shorter than 128 tokens, it may not be cached.
- Data Center Routing: While we make a best effort to route you to the same data center, traffic profiles may occasionally route you to a different location where your cache does not exist.
- TTL Expiration: If requests are sent more than 5 minutes apart, the cache may have been evicted.
Is prompt caching enabled for all customers?
Is prompt caching enabled for all customers?
Which models support prompt caching?
Which models support prompt caching?
Is prompt caching secure?
Is prompt caching secure?
How is data privacy maintained for caches?
How is data privacy maintained for caches?
Do I have to set prompt_cache_key to use prompt caching?
Do I have to set prompt_cache_key to use prompt caching?
prompt_cache_key is an optional routing hint that groups related requests so they reuse the same prompt cache, which improves the hit rate for workloads that don’t already rely on natural session affinity. See Maximize Cache Hits with prompt_cache_key.Does prompt caching affect output quality or speed?
Does prompt caching affect output quality or speed?
Can I manually clear the cache?
Can I manually clear the cache?
What are the TTL guarantees?
What are the TTL guarantees?
How can I tell when caching is working?
How can I tell when caching is working?
usage.prompt_tokens_details.cached_tokens field in your API response. When it’s greater than 0, caching was used for that request.Additionally, log in to cloud.cerebras.ai and click Analytics to track your cache usage.
