Skip to main content
This feature is in Private Preview. For access or more information, contact us or reach out to your account representative.
The Metrics API is designed for customers on dedicated endpoints who need granular observability into their inference workloads. Metrics are:
  • Aggregated at the minute level - Data represents the last complete minute
  • Pull-based - Your monitoring system queries the API on demand
  • Rate-limited - Up to 6 requests per minute
For complete API details including request format, authentication, and error codes, see the Metrics API reference.

Set Up Metrics Collection

Prerequisites: You need a dedicated Cerebras inference endpoint, your organization ID, and a valid API key.
1

Get your organization ID

Find your organization ID from your Cerebras account dashboard’s Settings page. It will look like org_abc123.
2

Configure your metrics endpoint

Use this URL format, replacing <organization-id> with your actual organization ID:
https://cloud.cerebras.ai/api/v1/metrics/organizations/<organization-id>
This endpoint can be directly used by monitoring tools like Prometheus, Grafana Cloud, or Datadog.
3

Set up authentication

Configure your monitoring tool to include the Authorization header with your Cerebras API key:
Authorization: Bearer YOUR_API_KEY
4

Configure scrape interval

Set your monitoring system to poll the endpoint every 60 seconds (1 minute). This matches the metric aggregation window and stays within rate limits.Rate limit: Maximum 6 requests per minute per organization
5

Test your integration

Verify the connection by making a test request:
curl -H "Authorization: Bearer $CEREBRAS_API_KEY" \
  https://cloud.cerebras.ai/api/v1/metrics/organizations/org_abc123

Direct Prometheus Integration

To integrate directly with Prometheus, specify the Cerebras metrics endpoint in your scrape config:
prometheus.yml
global:
  scrape_interval: 60s
scrape_configs:
  - job_name: 'cerebras-inference'
    metrics_path: '/api/v1/metrics/organizations/<organization-id>'
    authorization:
      type: "Bearer"
      credentials: "YOUR_API_KEY"
    static_configs:
      - targets: ['cloud.cerebras.ai']
    scheme: https
For more details on Prometheus configuration, refer to the Prometheus documentation.

Available Metrics

For a complete list of available metrics including endpoint health, requests, tokens, and latency percentiles, see the Available Metrics section in the API reference.

Understanding Percentiles

Latency metrics include multiple percentiles to give you a complete picture:
  • avg - Mean latency across all requests
  • p50 - Median latency (50th percentile)
  • p90 - 90% of requests complete faster than this
  • p95 - 95% of requests complete faster than this
  • p99 - 99% of requests complete faster than this
Example: If ttft_seconds{statistic="p95"} = 0.5, then 95% of your requests receive their first token within 500ms.

Example PromQL Queries

# Average end-to-end latency
avg(e2e_latency_seconds{statistic="avg"})

# Request success rate
rate(requests_success_total[5m]) / rate(requests_count_total[5m])

# Token throughput (tokens per second)
rate(output_tokens_total[5m])

# Cache hit rate
rate(cache_rate[5m])

# P95 time to first token
ttft_seconds{statistic="p95"}

Use Cases

Performance Monitoring

Track latency percentiles (avg, p50, p90, p95, p99) across multiple dimensions:
  • Time to First Token (TTFT) - Measure initial response latency
  • Time Per Output Token (TPOT) - Monitor generation speed
  • End-to-end latency - Track total request duration
  • Queue time - Identify capacity constraints

Usage Analytics

Monitor token consumption and request patterns:
  • Track input and output tokens for cost analysis
  • Measure cache hit rates with cache_rate
  • Analyze request success and failure rates
  • Monitor endpoint availability

Alerting and SLAs

Set up alerts based on:
  • Endpoint health status changes
  • Latency threshold breaches
  • Error rate spikes
  • Request volume anomalies

Troubleshooting

If queue_time_seconds percentiles are elevated:
  1. Check if you’re hitting capacity limits on your dedicated endpoint by checking the Analytics tab on cloud.cerebras.ai
  2. Review request patterns for traffic spikes
  3. Contact your Cerebras representative about scaling options
If requests_failure_total is increasing:
  1. Check the http_code label to identify specific error types
  2. Review the Error Codes documentation
  3. Verify authentication tokens are valid and not expired
  4. Check Rate Limits if seeing 429 errors
If cache_rate is lower than expected:
  1. Verify Prompt Caching is enabled
  2. Check that prompts have sufficient shared prefixes
  3. Review cache configuration with your Cerebras representative
If metrics appear stale:
  1. Verify your API key has correct permissions
  2. Check that metrics are enabled for your organization
  3. Ensure you’re querying the correct organization_id
  4. Check for error responses from the API