> ## Documentation Index
> Fetch the complete documentation index at: https://inference-docs.cerebras.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Metrics

> Monitor your dedicated inference endpoints with Prometheus-compatible metrics for requests, tokens, latency, and endpoint health.

The Metrics API is designed for customers on dedicated endpoints who need granular observability into their inference workloads. Metrics are:

* Aggregated at the minute level - Data represents the last complete minute
* Pull-based - Your monitoring system queries the API on demand
* Rate-limited - Up to 6 requests per minute

For complete API details including request format, authentication, and error codes, see the [Metrics API reference](/api-reference/metrics/retrieve-metrics).

## Set Up Metrics Collection

<Note>
  **Prerequisites:** You need a dedicated Cerebras inference endpoint, your organization ID, and a valid API key.
</Note>

<Steps>
  <Step title="Get your organization ID">
    Find your organization ID from your Cerebras account dashboard's Settings page. It will look like `org_abc123`.
  </Step>

  <Step title="Configure your metrics endpoint">
    Use this URL format, replacing `<organization-id>` with your actual organization ID:

    ```
    https://cloud.cerebras.ai/api/v1/metrics/organizations/<organization-id>
    ```

    This endpoint can be directly used by monitoring tools like Prometheus, Grafana Cloud, or Datadog.
  </Step>

  <Step title="Set up authentication">
    Configure your monitoring tool to include the Authorization header with your Cerebras API key:

    ```
    Authorization: Bearer YOUR_API_KEY
    ```
  </Step>

  <Step title="Configure scrape interval">
    Set your monitoring system to poll the endpoint every 60 seconds (1 minute). This matches the metric aggregation window and stays within rate limits.

    **Rate limit:** Maximum 6 requests per minute per organization
  </Step>

  <Step title="Test your integration">
    Verify the connection by making a test request:

    <CodeGroup>
      ```bash cURL theme={null}
      curl -H "Authorization: Bearer $CEREBRAS_API_KEY" \
        https://cloud.cerebras.ai/api/v1/metrics/organizations/org_abc123
      ```

      ```python Python theme={null}
      import requests
      import os

      response = requests.get(
          "https://cloud.cerebras.ai/api/v1/metrics/organizations/org_abc123",
          headers={
              "Authorization": f"Bearer {os.environ.get('CEREBRAS_API_KEY')}"
          }
      )

      # Parse Prometheus text format
      print(response.text)
      ```

      ```javascript Node.js theme={null}
      const response = await fetch(
        'https://cloud.cerebras.ai/api/v1/metrics/organizations/org_abc123',
        {
          headers: {
            'Authorization': `Bearer ${process.env.CEREBRAS_API_KEY}`
          }
        }
      );

      const metrics = await response.text();
      console.log(metrics);
      ```
    </CodeGroup>
  </Step>
</Steps>

## Direct Prometheus Integration

To integrate directly with Prometheus, specify the Cerebras metrics endpoint in your scrape config:

```yaml prometheus.yml theme={null}
global:
  scrape_interval: 60s
scrape_configs:
  - job_name: 'cerebras-inference'
    metrics_path: '/api/v1/metrics/organizations/<organization-id>'
    authorization:
      type: "Bearer"
      credentials: "YOUR_API_KEY"
    static_configs:
      - targets: ['cloud.cerebras.ai']
    scheme: https
```

For more details on Prometheus configuration, refer to the [Prometheus documentation](https://prometheus.io/docs/prometheus/latest/configuration/configuration/).

## Available Metrics

For a complete list of available metrics including endpoint health, requests, tokens, and latency percentiles, see the [Available Metrics](/api-reference/metrics/retrieve-metrics#available-metrics) section in the API reference.

### Understanding Percentiles

Latency metrics include multiple percentiles to give you a complete picture:

* **avg** - Mean latency across all requests
* **p50** - Median latency (50th percentile)
* **p90** - 90% of requests complete faster than this
* **p95** - 95% of requests complete faster than this
* **p99** - 99% of requests complete faster than this

**Example:** If `ttft_seconds{statistic="p95"} = 0.5`, then 95% of your requests receive their first token within 500ms.

## Example PromQL Queries

```promql theme={null}
# Average end-to-end latency
avg(e2e_latency_seconds{statistic="avg"})

# Request success rate
rate(requests_success_total[5m]) / rate(requests_count_total[5m])

# Token throughput (tokens per second)
rate(output_tokens_total[5m])

# Cache hit rate
rate(cache_rate[5m])

# P95 time to first token
ttft_seconds{statistic="p95"}
```

## Use Cases

### Performance Monitoring

Track latency percentiles (avg, p50, p90, p95, p99) across multiple dimensions:

* **Time to First Token (TTFT)** - Measure initial response latency
* **Time Per Output Token (TPOT)** - Monitor generation speed
* **End-to-end latency** - Track total request duration
* **Queue time** - Identify capacity constraints

### Usage Analytics

Monitor token consumption and request patterns:

* Track input and output tokens for cost analysis
* Measure cache hit rates with `cache_rate`
* Analyze request success and failure rates
* Monitor endpoint availability

### Alerting and SLAs

Set up alerts based on:

* Endpoint health status changes
* Latency threshold breaches
* Error rate spikes
* Request volume anomalies

## Troubleshooting

<AccordionGroup>
  <Accordion title="High Queue Times">
    If `queue_time_seconds` percentiles are elevated:

    1. Check if you're hitting capacity limits on your dedicated endpoint by checking the Analytics tab on cloud.cerebras.ai
    2. Review request patterns for traffic spikes
    3. Contact your Cerebras representative about scaling options
  </Accordion>

  <Accordion title="Elevated Error Rates">
    If `requests_failure_total` is increasing:

    1. Check the `http_code` label to identify specific error types
    2. Review the [Error Codes](/support/error) documentation
    3. Verify authentication tokens are valid and not expired
    4. Check [Rate Limits](/support/rate-limits) if seeing 429 errors
  </Accordion>

  <Accordion title="Low Cache Hit Rates">
    If `cache_rate` is lower than expected:

    1. Verify [Prompt Caching](/capabilities/prompt-caching) is enabled
    2. Check that prompts have sufficient shared prefixes
    3. Review cache configuration with your Cerebras representative
  </Accordion>

  <Accordion title="Metrics Not Updating">
    If metrics appear stale:

    1. Verify your API key has correct permissions
    2. Check that metrics are enabled for your organization
    3. Ensure you're querying the correct `organization_id`
    4. Check for [error responses](/support/error) from the API
  </Accordion>
</AccordionGroup>
