> ## Documentation Index
> Fetch the complete documentation index at: https://inference-docs.cerebras.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Retrieve metrics

> Retrieve operational metrics for your organization's inference endpoints in Prometheus format.

See the [Metrics](/capabilities/metrics) guide for more info.

## Path Parameters

<ParamField path="organization_id" type="string" required>
  The unique identifier for your organization (e.g., `org_abc123`)
</ParamField>

## Response

Returns metrics in Prometheus text-based exposition format.

<RequestExample>
  ```bash cURL theme={null}
  curl -H "Authorization: Bearer $CEREBRAS_API_KEY" \
    https://cloud.cerebras.ai/api/v1/metrics/organizations/org_abc123
  ```

  ```python Python theme={null}
  import requests
  import os

  response = requests.get(
      "https://cloud.cerebras.ai/api/v1/metrics/organizations/org_abc123",
      headers={
          "Authorization": f"Bearer {os.environ.get('CEREBRAS_API_KEY')}"
      }
  )

  print(response.text)
  ```

  ```javascript Node.js theme={null}
  const response = await fetch(
    'https://cloud.cerebras.ai/api/v1/metrics/organizations/org_abc123',
    {
      headers: {
        'Authorization': `Bearer ${process.env.CEREBRAS_API_KEY}`
      }
    }
  );

  const metrics = await response.text();
  console.log(metrics);
  ```
</RequestExample>

<ResponseExample>
  ```prometheus Response theme={null}
  # HELP inference_endpoint_status Status of inference endpoint (-1=error calculating status, 0=down, 1=up)
  # TYPE inference_endpoint_status gauge
  inference_endpoint_status{endpoint="model",organization_id="org_abc123"} 1.0

  # HELP requests_count_total Total request count (all HTTP codes) in the last complete minute
  # TYPE requests_count_total gauge
  requests_count_total{endpoint="model",organization_id="org_abc123"} 1.0

  # HELP requests_success_total Total successful requests (HTTP 200) in the last complete minute
  # TYPE requests_success_total gauge
  requests_success_total{endpoint="model",organization_id="org_abc123"} 1.0

  # HELP requests_failure_total Total failed requests by HTTP code in the last complete minute
  # TYPE requests_failure_total gauge
  requests_failure_total{endpoint="model",organization_id="org_abc123"} 0.0

  # HELP input_tokens_total Total input tokens for successful requests in the last complete minute
  # TYPE input_tokens_total gauge
  input_tokens_total{endpoint="model",organization_id="org_abc123"} 123456.0

  # HELP output_tokens_total Total output tokens for successful requests in the last complete minute
  # TYPE output_tokens_total gauge
  output_tokens_total{endpoint="model",organization_id="org_abc123"} 12345.0

  # HELP queue_time_seconds Queue time percentiles in seconds
  # TYPE queue_time_seconds gauge
  queue_time_seconds{endpoint="model",organization_id="org_abc123",percentile="avg"} 0.55
  queue_time_seconds{endpoint="model",organization_id="org_abc123",percentile="p50"} 0.55
  queue_time_seconds{endpoint="model",organization_id="org_abc123",percentile="p90"} 0.55
  queue_time_seconds{endpoint="model",organization_id="org_abc123",percentile="p95"} 0.55
  queue_time_seconds{endpoint="model",organization_id="org_abc123",percentile="p99"} 0.55

  # HELP e2e_latency_seconds End-to-end API latency percentiles in seconds
  # TYPE e2e_latency_seconds gauge
  e2e_latency_seconds{endpoint="model",organization_id="org_abc123",statistic="avg"} 0.9
  e2e_latency_seconds{endpoint="model",organization_id="org_abc123",statistic="p50"} 0.9
  e2e_latency_seconds{endpoint="model",organization_id="org_abc123",statistic="p90"} 0.9
  e2e_latency_seconds{endpoint="model",organization_id="org_abc123",statistic="p95"} 0.9
  e2e_latency_seconds{endpoint="model",organization_id="org_abc123",statistic="p99"} 0.9

  # HELP ttft_seconds Time To First Token percentiles in seconds
  # TYPE ttft_seconds gauge
  ttft_seconds{endpoint="model",organization_id="org_abc123",statistic="avg"} 0.9
  ttft_seconds{endpoint="model",organization_id="org_abc123",statistic="p50"} 0.9
  ttft_seconds{endpoint="model",organization_id="org_abc123",statistic="p90"} 0.9
  ttft_seconds{endpoint="model",organization_id="org_abc123",statistic="p95"} 0.9
  ttft_seconds{endpoint="model",organization_id="org_abc123",statistic="p99"} 0.9

  # HELP cache_reads_total Total input tokens read from cache for successful requests in the last complete minute
  # TYPE cache_reads_total gauge
  cache_reads_total{endpoint="model",organization_id="org_abc123"} 1234.0

  # HELP cache_rate Ratio of input tokens read from cache, to total input tokens, for successful requests in the last complete minute
  # TYPE cache_rate gauge
  cache_rate{endpoint="model",organization_id="org_abc123"} 0.01

  # HELP tpot Time Per Output Token (TPOT) percentiles
  # TYPE tpot gauge
  tpot{endpoint="model",organization_id="org_abc123",statistic="avg"} 0.0001
  tpot{endpoint="model",organization_id="org_abc123",statistic="p50"} 0.0001
  tpot{endpoint="model",organization_id="org_abc123",statistic="p90"} 0.0001
  tpot{endpoint="model",organization_id="org_abc123",statistic="p95"} 0.0001
  tpot{endpoint="model",organization_id="org_abc123",statistic="p99"} 0.0001

  # HELP latency_generation_seconds Completion time percentiles in seconds
  # TYPE latency_generation_seconds gauge
  latency_generation_seconds{endpoint="model",organization_id="org_abc123",statistic="avg"} 1.1
  latency_generation_seconds{endpoint="model",organization_id="org_abc123",statistic="p50"} 1.1
  latency_generation_seconds{endpoint="model",organization_id="org_abc123",statistic="p90"} 1.1
  latency_generation_seconds{endpoint="model",organization_id="org_abc123",statistic="p95"} 1.1
  latency_generation_seconds{endpoint="model",organization_id="org_abc123",statistic="p99"} 1.1
  ```
</ResponseExample>

## Available Metrics

The following metrics are available on an opt-in basis. Contact your Cerebras account representative to enable specific metrics for your organization.

### Endpoint Health

<ResponseField name="inference_endpoint_status" type="gauge">
  Status of inference endpoint

  **Values:**

  * `-1` = Error calculating status
  * `0` = Down
  * `1` = Up
</ResponseField>

### Request Metrics

<ResponseField name="requests_count_total" type="gauge">
  Total request count (all HTTP codes) in the last complete minute
</ResponseField>

<ResponseField name="requests_success_total" type="gauge">
  Total successful requests (HTTP 200) in the last complete minute
</ResponseField>

<ResponseField name="requests_failure_total" type="gauge">
  Total failed requests by HTTP code in the last complete minute
</ResponseField>

### Token Metrics

<ResponseField name="input_tokens_total" type="gauge">
  Total input tokens for successful requests in the last complete minute
</ResponseField>

<ResponseField name="output_tokens_total" type="gauge">
  Total output tokens for successful requests in the last complete minute
</ResponseField>

<ResponseField name="cache_reads_total" type="gauge">
  Total input tokens read from cache for successful requests in the last complete minute
</ResponseField>

<ResponseField name="cache_rate" type="gauge">
  Ratio of input tokens read from cache, to total input tokens, for successful requests in the last complete minute
</ResponseField>

### Latency Metrics

<ResponseField name="queue_time_seconds" type="gauge">
  Queue time percentiles in seconds for successful requests (avg/p50/p90/p95/p99) (e.g. time a request spends waiting for resources at runtime)
</ResponseField>

<ResponseField name="e2e_latency_seconds" type="gauge">
  End-to-end API latency percentiles in seconds for successful requests (avg/p50/p90/p95/p99). Includes overall latency from requests received at the API gateway to the response output from API gateway, inclusive of `latency_generation_seconds`.
</ResponseField>

<ResponseField name="ttft_seconds" type="gauge">
  Time To First Token percentiles in seconds for successful requests (avg/p50/p90/p95/p99)
</ResponseField>

<ResponseField name="tpot" type="gauge">
  Time per output tokens percentiles (avg/p50/p90/p95/p99), excluding time to first token, averaged across successful requests
</ResponseField>

<ResponseField name="latency_generation_seconds" type="gauge">
  Time to generate all output tokens (e.g. time from last prompt to last output token) percentiles in seconds for successful requests (avg/p50/p90/p95/p99)
</ResponseField>

## Error Codes

For information about possible error responses, see the [Error Codes](/support/error) documentation.
