Metrics

This feature is in Private Preview. For access or more information, contact us or reach out to your account representative.

The Metrics API is designed for customers on dedicated endpoints who need granular observability into their inference workloads. Metrics are:

Aggregated at the minute level - Data represents the last complete minute
Pull-based - Your monitoring system queries the API on demand
Rate-limited - Up to 6 requests per minute

For complete API details including request format, authentication, and error codes, see the Metrics API reference.

Set Up Metrics Collection

Prerequisites: You need a dedicated Cerebras inference endpoint, your organization ID, and a valid API key.

Get your organization ID

Find your organization ID from your Cerebras account dashboard’s Settings page. It will look like org_abc123.

Configure your metrics endpoint

Use this URL format, replacing <organization-id> with your actual organization ID:

https://cloud.cerebras.ai/api/v1/metrics/organizations/<organization-id>

This endpoint can be directly used by monitoring tools like Prometheus, Grafana Cloud, or Datadog.

Set up authentication

Configure your monitoring tool to include the Authorization header with your Cerebras API key:

Authorization: Bearer YOUR_API_KEY

Configure scrape interval

Set your monitoring system to poll the endpoint every 60 seconds (1 minute). This matches the metric aggregation window and stays within rate limits.Rate limit: Maximum 6 requests per minute per organization

Test your integration

Verify the connection by making a test request:

curl -H "Authorization: Bearer $CEREBRAS_API_KEY" \
  https://cloud.cerebras.ai/api/v1/metrics/organizations/org_abc123

Direct Prometheus Integration

To integrate directly with Prometheus, specify the Cerebras metrics endpoint in your scrape config:

prometheus.yml

global:
  scrape_interval: 60s
scrape_configs:
  - job_name: 'cerebras-inference'
    metrics_path: '/api/v1/metrics/organizations/<organization-id>'
    authorization:
      type: "Bearer"
      credentials: "YOUR_API_KEY"
    static_configs:
      - targets: ['cloud.cerebras.ai']
    scheme: https

For more details on Prometheus configuration, refer to the Prometheus documentation.

Available Metrics

For a complete list of available metrics including endpoint health, requests, tokens, and latency percentiles, see the Available Metrics section in the API reference.

Understanding Percentiles

Latency metrics include multiple percentiles to give you a complete picture:

avg - Mean latency across all requests
p50 - Median latency (50th percentile)
p90 - 90% of requests complete faster than this
p95 - 95% of requests complete faster than this
p99 - 99% of requests complete faster than this

Example: If ttft_seconds{statistic="p95"} = 0.5, then 95% of your requests receive their first token within 500ms.

Example PromQL Queries

# Average end-to-end latency
avg(e2e_latency_seconds{statistic="avg"})

# Request success rate
rate(requests_success_total[5m]) / rate(requests_count_total[5m])

# Token throughput (tokens per second)
rate(output_tokens_total[5m])

# Cache hit rate
rate(cache_rate[5m])

# P95 time to first token
ttft_seconds{statistic="p95"}

Use Cases

Performance Monitoring

Track latency percentiles (avg, p50, p90, p95, p99) across multiple dimensions:

Time to First Token (TTFT) - Measure initial response latency
Time Per Output Token (TPOT) - Monitor generation speed
End-to-end latency - Track total request duration
Queue time - Identify capacity constraints

Usage Analytics

Monitor token consumption and request patterns:

Track input and output tokens for cost analysis
Measure cache hit rates with cache_rate
Analyze request success and failure rates
Monitor endpoint availability

Alerting and SLAs

Set up alerts based on:

Endpoint health status changes
Latency threshold breaches
Error rate spikes
Request volume anomalies

Troubleshooting

High Queue Times

If queue_time_seconds percentiles are elevated:

Check if you’re hitting capacity limits on your dedicated endpoint by checking the Analytics tab on cloud.cerebras.ai
Review request patterns for traffic spikes
Contact your Cerebras representative about scaling options

Elevated Error Rates

If requests_failure_total is increasing:

Check the http_code label to identify specific error types
Review the Error Codes documentation
Verify authentication tokens are valid and not expired
Check Rate Limits if seeing 429 errors

Low Cache Hit Rates

If cache_rate is lower than expected:

Verify Prompt Caching is enabled
Check that prompts have sufficient shared prefixes
Review cache configuration with your Cerebras representative

Metrics Not Updating

If metrics appear stale:

Verify your API key has correct permissions
Check that metrics are enabled for your organization
Ensure you’re querying the correct organization_id
Check for error responses from the API

Get Started

Capabilities

Compatibility

Resources

Support

Set Up Metrics Collection

Direct Prometheus Integration

Available Metrics

Understanding Percentiles

Example PromQL Queries

Use Cases

Performance Monitoring

Usage Analytics

Alerting and SLAs

Troubleshooting

Get Started

Capabilities

Compatibility

Resources

Support

​Set Up Metrics Collection

​Direct Prometheus Integration

​Available Metrics

​Understanding Percentiles

​Example PromQL Queries

​Use Cases

​Performance Monitoring

​Usage Analytics

​Alerting and SLAs

​Troubleshooting

Set Up Metrics Collection

Direct Prometheus Integration

Available Metrics

Understanding Percentiles

Example PromQL Queries

Use Cases

Performance Monitoring

Usage Analytics

Alerting and SLAs

Troubleshooting