How It Works
Unlike other providers that require manual cache breakpoints or header modifications, Cerebras Prompt Caching works automatically on all supported API requests. No code changes are required.- Prefix Matching: When you send a request, the system analyzes the beginning of your prompt (the prefix). This includes system prompts, tool definitions, and few-shot examples.
- Block-Based Caching: The system processes prompts in 128-token blocks. If a block matches a segment stored in our ephemeral memory from a recent request within your organization, the computation is reused.
- Cache Hit: Reusing cached blocks skips the processing phase for those tokens, resulting in lower latency.
- Cache Miss: If no match is found, the prompt is processed as normal, and the prefix is stored in the cache for potential future matches.
- Automatic Expiration: Cached data is ephemeral. We guarantee a Time-To-Live (TTL) of 5 minutes, though caches may persist up to 1 hour depending on system load.
Example: Multi-Turn Conversation with Tool Calling
In this scenario, a shopping assistant helps users check order status and cancel orders using two tools:get_order_status and cancel_order. The system message and tool definitions remain constant across turns and are cached, while the conversation progresses naturally.
- System message: The shopping assistant instructions remain identical across all turns
- Tool definitions: Both order management tool schemas (including parameters and descriptions) stay constant
- Conversation history: Previous user messages, assistant responses, and tool results are all cached as the conversation grows
- New user messages (the latest question)
- New tool execution results
- The model’s reasoning and decision-making for the current turn
Structuring Prompts for Caching
To maximize cache hits and minimize latency, organize your prompts with static content first and dynamic content last. The system caches prompts from the beginning of the message. If you place dynamic content (like a timestamp or a unique User ID) at the start of the prompt, the prefix will differ for every request and the cache will never be triggered.Static Content First
- System instructions (“You are a helpful assistant…”)
- Tool definitions and schemas
- Few-shot examples
- Large context documents (e.g., a legal agreement or code base)
- Optimized (Cache Hit)
- Inefficient (Cache Miss)
Maximize Cache Hits with prompt_cache_key
prompt_cache_key must be enabled on your account before you can use it. Contact us or reach out to your account representative to request access.prompt_cache_key on each request that shares the same prompt prefix. This optional opaque string tells the system which requests are likely to share a common prompt prefix, so it can route those requests to the same prompt cache.
Prompt caching and prefix matching work automatically on every request, so you benefit from caching even without setting prompt_cache_key. Use it when you want to give the system an explicit routing hint. This keeps requests from the same conversation or workflow on the same prompt cache. Under load, turn 1 of a session can be routed to one prompt cache and turn 2 to another, causing a cache miss even though the prefixes match.
When to Set a Cache Key
Choose a stable identifier for one conversation or workflow whose requests actually share a common prompt prefix:- Multi-turn chat session: Use the conversation ID so every turn in the same conversation uses the same prompt cache.
- Agentic or RAG workflow: Use one workflow ID only for a small set of steps that reuse the same prompt prefix.
prompt_cache_key values have a maximum length of 1024 characters. Longer values are rejected with a 400 error.Track Cache Usage
Verify if your requests are hitting the cache by viewing thecached_tokens field within the usage.prompt_token_details response object. This indicates the number of prompt tokens that were found in the cache.
FAQs
Do cached tokens count toward rate limits?
Do cached tokens count toward rate limits?
cached_tokens + input_tokens (fresh) = Total TPM usage for that request.How are cached tokens priced?
How are cached tokens priced?
I'm sending the same request but not seeing it being cached. Why is that?
I'm sending the same request but not seeing it being cached. Why is that?
- Block Size: We cache in 128-token blocks. If a request or prefix is shorter than 128 tokens, it may not be cached.
- Data Center Routing: While we make a best effort to route you to the same data center, traffic profiles may occasionally route you to a different location where your cache does not exist.
- TTL Expiration: If requests are sent more than 5 minutes apart, the cache may have been evicted.
Is prompt caching enabled for all customers?
Is prompt caching enabled for all customers?
Which models support prompt caching?
Which models support prompt caching?
Is prompt caching secure?
Is prompt caching secure?
How is data privacy maintained for caches?
How is data privacy maintained for caches?
Do I have to set prompt_cache_key to use prompt caching?
Do I have to set prompt_cache_key to use prompt caching?
prompt_cache_key is an optional routing hint for requests in the same conversation or workflow that share a common prompt prefix. Do not set it just to share one large system prompt or RAG context across many users. See Maximize Cache Hits with prompt_cache_key.Does prompt caching affect output quality or speed?
Does prompt caching affect output quality or speed?
Can I manually clear the cache?
Can I manually clear the cache?
What are the TTL guarantees?
What are the TTL guarantees?
How can I tell when caching is working?
How can I tell when caching is working?
usage.prompt_tokens_details.cached_tokens field in your API response. When it’s greater than 0, caching was used for that request.Additionally, log in to cloud.cerebras.ai and click Analytics to track your cache usage.
