Retrieve metrics

curl -H "Authorization: Bearer $CEREBRAS_API_KEY" \
  https://cloud.cerebras.ai/api/v1/metrics/organizations/org_abc123

# HELP inference_endpoint_status Status of inference endpoint (-1=error calculating status, 0=down, 1=up)
# TYPE inference_endpoint_status gauge
inference_endpoint_status{endpoint="model",organization_id="org_abc123"} 1.0

# HELP requests_count_total Total request count (all HTTP codes) in the last complete minute
# TYPE requests_count_total gauge
requests_count_total{endpoint="model",organization_id="org_abc123"} 1.0

# HELP requests_success_total Total successful requests (HTTP 200) in the last complete minute
# TYPE requests_success_total gauge
requests_success_total{endpoint="model",organization_id="org_abc123"} 1.0

# HELP requests_failure_total Total failed requests by HTTP code in the last complete minute
# TYPE requests_failure_total gauge
requests_failure_total{endpoint="model",organization_id="org_abc123"} 0.0

# HELP input_tokens_total Total input tokens for successful requests in the last complete minute
# TYPE input_tokens_total gauge
input_tokens_total{endpoint="model",organization_id="org_abc123"} 123456.0

# HELP output_tokens_total Total output tokens for successful requests in the last complete minute
# TYPE output_tokens_total gauge
output_tokens_total{endpoint="model",organization_id="org_abc123"} 12345.0

# HELP queue_time_seconds Queue time percentiles in seconds
# TYPE queue_time_seconds gauge
queue_time_seconds{endpoint="model",organization_id="org_abc123",percentile="avg"} 0.55
queue_time_seconds{endpoint="model",organization_id="org_abc123",percentile="p50"} 0.55
queue_time_seconds{endpoint="model",organization_id="org_abc123",percentile="p90"} 0.55
queue_time_seconds{endpoint="model",organization_id="org_abc123",percentile="p95"} 0.55
queue_time_seconds{endpoint="model",organization_id="org_abc123",percentile="p99"} 0.55

# HELP e2e_latency_seconds End-to-end API latency percentiles in seconds
# TYPE e2e_latency_seconds gauge
e2e_latency_seconds{endpoint="model",organization_id="org_abc123",statistic="avg"} 0.9
e2e_latency_seconds{endpoint="model",organization_id="org_abc123",statistic="p50"} 0.9
e2e_latency_seconds{endpoint="model",organization_id="org_abc123",statistic="p90"} 0.9
e2e_latency_seconds{endpoint="model",organization_id="org_abc123",statistic="p95"} 0.9
e2e_latency_seconds{endpoint="model",organization_id="org_abc123",statistic="p99"} 0.9

# HELP ttft_seconds Time To First Token percentiles in seconds
# TYPE ttft_seconds gauge
ttft_seconds{endpoint="model",organization_id="org_abc123",statistic="avg"} 0.9
ttft_seconds{endpoint="model",organization_id="org_abc123",statistic="p50"} 0.9
ttft_seconds{endpoint="model",organization_id="org_abc123",statistic="p90"} 0.9
ttft_seconds{endpoint="model",organization_id="org_abc123",statistic="p95"} 0.9
ttft_seconds{endpoint="model",organization_id="org_abc123",statistic="p99"} 0.9

# HELP cache_reads_total Total input tokens read from cache for successful requests in the last complete minute
# TYPE cache_reads_total gauge
cache_reads_total{endpoint="model",organization_id="org_abc123"} 1234.0

# HELP cache_rate Ratio of input tokens read from cache, to total input tokens, for successful requests in the last complete minute
# TYPE cache_rate gauge
cache_rate{endpoint="model",organization_id="org_abc123"} 0.01

# HELP tpot Time Per Output Token (TPOT) percentiles
# TYPE tpot gauge
tpot{endpoint="model",organization_id="org_abc123",statistic="avg"} 0.0001
tpot{endpoint="model",organization_id="org_abc123",statistic="p50"} 0.0001
tpot{endpoint="model",organization_id="org_abc123",statistic="p90"} 0.0001
tpot{endpoint="model",organization_id="org_abc123",statistic="p95"} 0.0001
tpot{endpoint="model",organization_id="org_abc123",statistic="p99"} 0.0001

# HELP latency_generation_seconds Completion time percentiles in seconds
# TYPE latency_generation_seconds gauge
latency_generation_seconds{endpoint="model",organization_id="org_abc123",statistic="avg"} 1.1
latency_generation_seconds{endpoint="model",organization_id="org_abc123",statistic="p50"} 1.1
latency_generation_seconds{endpoint="model",organization_id="org_abc123",statistic="p90"} 1.1
latency_generation_seconds{endpoint="model",organization_id="org_abc123",statistic="p95"} 1.1
latency_generation_seconds{endpoint="model",organization_id="org_abc123",statistic="p99"} 1.1

GET

https://cloud.cerebras.ai

api

metrics

organizations

{organization_id}

curl -H "Authorization: Bearer $CEREBRAS_API_KEY" \
  https://cloud.cerebras.ai/api/v1/metrics/organizations/org_abc123

# HELP inference_endpoint_status Status of inference endpoint (-1=error calculating status, 0=down, 1=up)
# TYPE inference_endpoint_status gauge
inference_endpoint_status{endpoint="model",organization_id="org_abc123"} 1.0

# HELP requests_count_total Total request count (all HTTP codes) in the last complete minute
# TYPE requests_count_total gauge
requests_count_total{endpoint="model",organization_id="org_abc123"} 1.0

# HELP requests_success_total Total successful requests (HTTP 200) in the last complete minute
# TYPE requests_success_total gauge
requests_success_total{endpoint="model",organization_id="org_abc123"} 1.0

# HELP requests_failure_total Total failed requests by HTTP code in the last complete minute
# TYPE requests_failure_total gauge
requests_failure_total{endpoint="model",organization_id="org_abc123"} 0.0

# HELP input_tokens_total Total input tokens for successful requests in the last complete minute
# TYPE input_tokens_total gauge
input_tokens_total{endpoint="model",organization_id="org_abc123"} 123456.0

# HELP output_tokens_total Total output tokens for successful requests in the last complete minute
# TYPE output_tokens_total gauge
output_tokens_total{endpoint="model",organization_id="org_abc123"} 12345.0

# HELP queue_time_seconds Queue time percentiles in seconds
# TYPE queue_time_seconds gauge
queue_time_seconds{endpoint="model",organization_id="org_abc123",percentile="avg"} 0.55
queue_time_seconds{endpoint="model",organization_id="org_abc123",percentile="p50"} 0.55
queue_time_seconds{endpoint="model",organization_id="org_abc123",percentile="p90"} 0.55
queue_time_seconds{endpoint="model",organization_id="org_abc123",percentile="p95"} 0.55
queue_time_seconds{endpoint="model",organization_id="org_abc123",percentile="p99"} 0.55

# HELP e2e_latency_seconds End-to-end API latency percentiles in seconds
# TYPE e2e_latency_seconds gauge
e2e_latency_seconds{endpoint="model",organization_id="org_abc123",statistic="avg"} 0.9
e2e_latency_seconds{endpoint="model",organization_id="org_abc123",statistic="p50"} 0.9
e2e_latency_seconds{endpoint="model",organization_id="org_abc123",statistic="p90"} 0.9
e2e_latency_seconds{endpoint="model",organization_id="org_abc123",statistic="p95"} 0.9
e2e_latency_seconds{endpoint="model",organization_id="org_abc123",statistic="p99"} 0.9

# HELP ttft_seconds Time To First Token percentiles in seconds
# TYPE ttft_seconds gauge
ttft_seconds{endpoint="model",organization_id="org_abc123",statistic="avg"} 0.9
ttft_seconds{endpoint="model",organization_id="org_abc123",statistic="p50"} 0.9
ttft_seconds{endpoint="model",organization_id="org_abc123",statistic="p90"} 0.9
ttft_seconds{endpoint="model",organization_id="org_abc123",statistic="p95"} 0.9
ttft_seconds{endpoint="model",organization_id="org_abc123",statistic="p99"} 0.9

# HELP cache_reads_total Total input tokens read from cache for successful requests in the last complete minute
# TYPE cache_reads_total gauge
cache_reads_total{endpoint="model",organization_id="org_abc123"} 1234.0

# HELP cache_rate Ratio of input tokens read from cache, to total input tokens, for successful requests in the last complete minute
# TYPE cache_rate gauge
cache_rate{endpoint="model",organization_id="org_abc123"} 0.01

# HELP tpot Time Per Output Token (TPOT) percentiles
# TYPE tpot gauge
tpot{endpoint="model",organization_id="org_abc123",statistic="avg"} 0.0001
tpot{endpoint="model",organization_id="org_abc123",statistic="p50"} 0.0001
tpot{endpoint="model",organization_id="org_abc123",statistic="p90"} 0.0001
tpot{endpoint="model",organization_id="org_abc123",statistic="p95"} 0.0001
tpot{endpoint="model",organization_id="org_abc123",statistic="p99"} 0.0001

# HELP latency_generation_seconds Completion time percentiles in seconds
# TYPE latency_generation_seconds gauge
latency_generation_seconds{endpoint="model",organization_id="org_abc123",statistic="avg"} 1.1
latency_generation_seconds{endpoint="model",organization_id="org_abc123",statistic="p50"} 1.1
latency_generation_seconds{endpoint="model",organization_id="org_abc123",statistic="p90"} 1.1
latency_generation_seconds{endpoint="model",organization_id="org_abc123",statistic="p95"} 1.1
latency_generation_seconds{endpoint="model",organization_id="org_abc123",statistic="p99"} 1.1

This feature is in Private Preview. For access or more information, contact us or reach out to your account representative.

See the Metrics guide for more info.

Path Parameters

organization_id

string

required

The unique identifier for your organization (e.g., org_abc123)

Response

Returns metrics in Prometheus text-based exposition format.

curl -H "Authorization: Bearer $CEREBRAS_API_KEY" \
  https://cloud.cerebras.ai/api/v1/metrics/organizations/org_abc123

# HELP inference_endpoint_status Status of inference endpoint (-1=error calculating status, 0=down, 1=up)
# TYPE inference_endpoint_status gauge
inference_endpoint_status{endpoint="model",organization_id="org_abc123"} 1.0

# HELP requests_count_total Total request count (all HTTP codes) in the last complete minute
# TYPE requests_count_total gauge
requests_count_total{endpoint="model",organization_id="org_abc123"} 1.0

# HELP requests_success_total Total successful requests (HTTP 200) in the last complete minute
# TYPE requests_success_total gauge
requests_success_total{endpoint="model",organization_id="org_abc123"} 1.0

# HELP requests_failure_total Total failed requests by HTTP code in the last complete minute
# TYPE requests_failure_total gauge
requests_failure_total{endpoint="model",organization_id="org_abc123"} 0.0

# HELP input_tokens_total Total input tokens for successful requests in the last complete minute
# TYPE input_tokens_total gauge
input_tokens_total{endpoint="model",organization_id="org_abc123"} 123456.0

# HELP output_tokens_total Total output tokens for successful requests in the last complete minute
# TYPE output_tokens_total gauge
output_tokens_total{endpoint="model",organization_id="org_abc123"} 12345.0

# HELP queue_time_seconds Queue time percentiles in seconds
# TYPE queue_time_seconds gauge
queue_time_seconds{endpoint="model",organization_id="org_abc123",percentile="avg"} 0.55
queue_time_seconds{endpoint="model",organization_id="org_abc123",percentile="p50"} 0.55
queue_time_seconds{endpoint="model",organization_id="org_abc123",percentile="p90"} 0.55
queue_time_seconds{endpoint="model",organization_id="org_abc123",percentile="p95"} 0.55
queue_time_seconds{endpoint="model",organization_id="org_abc123",percentile="p99"} 0.55

# HELP e2e_latency_seconds End-to-end API latency percentiles in seconds
# TYPE e2e_latency_seconds gauge
e2e_latency_seconds{endpoint="model",organization_id="org_abc123",statistic="avg"} 0.9
e2e_latency_seconds{endpoint="model",organization_id="org_abc123",statistic="p50"} 0.9
e2e_latency_seconds{endpoint="model",organization_id="org_abc123",statistic="p90"} 0.9
e2e_latency_seconds{endpoint="model",organization_id="org_abc123",statistic="p95"} 0.9
e2e_latency_seconds{endpoint="model",organization_id="org_abc123",statistic="p99"} 0.9

# HELP ttft_seconds Time To First Token percentiles in seconds
# TYPE ttft_seconds gauge
ttft_seconds{endpoint="model",organization_id="org_abc123",statistic="avg"} 0.9
ttft_seconds{endpoint="model",organization_id="org_abc123",statistic="p50"} 0.9
ttft_seconds{endpoint="model",organization_id="org_abc123",statistic="p90"} 0.9
ttft_seconds{endpoint="model",organization_id="org_abc123",statistic="p95"} 0.9
ttft_seconds{endpoint="model",organization_id="org_abc123",statistic="p99"} 0.9

# HELP cache_reads_total Total input tokens read from cache for successful requests in the last complete minute
# TYPE cache_reads_total gauge
cache_reads_total{endpoint="model",organization_id="org_abc123"} 1234.0

# HELP cache_rate Ratio of input tokens read from cache, to total input tokens, for successful requests in the last complete minute
# TYPE cache_rate gauge
cache_rate{endpoint="model",organization_id="org_abc123"} 0.01

# HELP tpot Time Per Output Token (TPOT) percentiles
# TYPE tpot gauge
tpot{endpoint="model",organization_id="org_abc123",statistic="avg"} 0.0001
tpot{endpoint="model",organization_id="org_abc123",statistic="p50"} 0.0001
tpot{endpoint="model",organization_id="org_abc123",statistic="p90"} 0.0001
tpot{endpoint="model",organization_id="org_abc123",statistic="p95"} 0.0001
tpot{endpoint="model",organization_id="org_abc123",statistic="p99"} 0.0001

# HELP latency_generation_seconds Completion time percentiles in seconds
# TYPE latency_generation_seconds gauge
latency_generation_seconds{endpoint="model",organization_id="org_abc123",statistic="avg"} 1.1
latency_generation_seconds{endpoint="model",organization_id="org_abc123",statistic="p50"} 1.1
latency_generation_seconds{endpoint="model",organization_id="org_abc123",statistic="p90"} 1.1
latency_generation_seconds{endpoint="model",organization_id="org_abc123",statistic="p95"} 1.1
latency_generation_seconds{endpoint="model",organization_id="org_abc123",statistic="p99"} 1.1

Available Metrics

The following metrics are available on an opt-in basis. Contact your Cerebras account representative to enable specific metrics for your organization.

Endpoint Health

inference_endpoint_status

gauge

Status of inference endpointValues:

-1 = Error calculating status
0 = Down
1 = Up

Request Metrics

requests_count_total

gauge

Total request count (all HTTP codes) in the last complete minute

requests_success_total

gauge

Total successful requests (HTTP 200) in the last complete minute

requests_failure_total

gauge

Total failed requests by HTTP code in the last complete minute

Token Metrics

input_tokens_total

gauge

Total input tokens for successful requests in the last complete minute

output_tokens_total

gauge

Total output tokens for successful requests in the last complete minute

cache_reads_total

gauge

Total input tokens read from cache for successful requests in the last complete minute

cache_rate

gauge

Ratio of input tokens read from cache, to total input tokens, for successful requests in the last complete minute

Latency Metrics

queue_time_seconds

gauge

Queue time percentiles in seconds for successful requests (avg/p50/p90/p95/p99) (e.g. time a request spends waiting for resources at runtime)

e2e_latency_seconds

gauge

End-to-end API latency percentiles in seconds for successful requests (avg/p50/p90/p95/p99). Includes overall latency from requests received at the API gateway to the response output from API gateway, inclusive of latency_generation_seconds.

ttft_seconds

gauge

Time To First Token percentiles in seconds for successful requests (avg/p50/p90/p95/p99)

tpot

gauge

Time per output tokens percentiles (avg/p50/p90/p95/p99), excluding time to first token, averaged across successful requests

latency_generation_seconds

gauge

Time to generate all output tokens (e.g. time from last prompt to last output token) percentiles in seconds for successful requests (avg/p50/p90/p95/p99)

Error Codes

For information about possible error responses, see the Error Codes documentation.

Delete file

List model architectures

⌘I

Introduction

Chat

Completions

Models

Batch

Files

Metrics

Management

Path Parameters

Response

Available Metrics

Endpoint Health

Request Metrics

Token Metrics

Latency Metrics

Error Codes

Introduction

Chat

Completions

Models

Batch

Files

Metrics

Management

​Path Parameters

​Response

​Available Metrics

​Endpoint Health

​Request Metrics

​Token Metrics

​Latency Metrics

​Error Codes

Path Parameters

Response

Available Metrics

Endpoint Health

Request Metrics

Token Metrics

Latency Metrics

Error Codes