This feature is in Private Preview. For access or more information, contact us or reach out to your account representative.
- Aggregated at the minute level - Data represents the last complete minute
- Pull-based - Your monitoring system queries the API on demand
- Rate-limited - Up to 6 requests per minute
Set Up Metrics Collection
Prerequisites: You need a dedicated Cerebras inference endpoint, your organization ID, and a valid API key.
1
Get your organization ID
Find your organization ID from your Cerebras account dashboard’s Settings page. It will look like
org_abc123.2
Configure your metrics endpoint
Use this URL format, replacing This endpoint can be directly used by monitoring tools like Prometheus, Grafana Cloud, or Datadog.
<organization-id> with your actual organization ID:3
Set up authentication
Configure your monitoring tool to include the Authorization header with your Cerebras API key:
4
Configure scrape interval
Set your monitoring system to poll the endpoint every 60 seconds (1 minute). This matches the metric aggregation window and stays within rate limits.Rate limit: Maximum 6 requests per minute per organization
5
Test your integration
Verify the connection by making a test request:
Direct Prometheus Integration
To integrate directly with Prometheus, specify the Cerebras metrics endpoint in your scrape config:prometheus.yml
Available Metrics
For a complete list of available metrics including endpoint health, requests, tokens, and latency percentiles, see the Available Metrics section in the API reference.Understanding Percentiles
Latency metrics include multiple percentiles to give you a complete picture:- avg - Mean latency across all requests
- p50 - Median latency (50th percentile)
- p90 - 90% of requests complete faster than this
- p95 - 95% of requests complete faster than this
- p99 - 99% of requests complete faster than this
ttft_seconds{statistic="p95"} = 0.5, then 95% of your requests receive their first token within 500ms.
Example PromQL Queries
Use Cases
Performance Monitoring
Track latency percentiles (avg, p50, p90, p95, p99) across multiple dimensions:- Time to First Token (TTFT) - Measure initial response latency
- Time Per Output Token (TPOT) - Monitor generation speed
- End-to-end latency - Track total request duration
- Queue time - Identify capacity constraints
Usage Analytics
Monitor token consumption and request patterns:- Track input and output tokens for cost analysis
- Measure cache hit rates with
cache_rate - Analyze request success and failure rates
- Monitor endpoint availability
Alerting and SLAs
Set up alerts based on:- Endpoint health status changes
- Latency threshold breaches
- Error rate spikes
- Request volume anomalies
Troubleshooting
High Queue Times
High Queue Times
If
queue_time_seconds percentiles are elevated:- Check if you’re hitting capacity limits on your dedicated endpoint by checking the Analytics tab on cloud.cerebras.ai
- Review request patterns for traffic spikes
- Contact your Cerebras representative about scaling options
Elevated Error Rates
Elevated Error Rates
If
requests_failure_total is increasing:- Check the
http_codelabel to identify specific error types - Review the Error Codes documentation
- Verify authentication tokens are valid and not expired
- Check Rate Limits if seeing 429 errors
Low Cache Hit Rates
Low Cache Hit Rates
If
cache_rate is lower than expected:- Verify Prompt Caching is enabled
- Check that prompts have sufficient shared prefixes
- Review cache configuration with your Cerebras representative
Metrics Not Updating
Metrics Not Updating
If metrics appear stale:
- Verify your API key has correct permissions
- Check that metrics are enabled for your organization
- Ensure you’re querying the correct
organization_id - Check for error responses from the API

