The Cerebras Inference API offers developers a low-latency solution for AI model inference powered by Cerebras Wafer-Scale Engines and CS-3 systems. We invite developers to explore the new possibilities that our high-speed inferencing solution unlocks.

To get started with a free API key, click here.

The Cerebras Inference API currently provides access to the following models:

Model NameModel IDParametersSpeed (tokens/s)
Llama 4 Scoutllama-4-scout-17b-16e-instruct109 billion~2600 tokens/s
Llama 3.1 8Bllama3.1-8b8 billion~2200 tokens/s
Llama 3.3 70Bllama-3.3-70b70 billion~2100 tokens/s
Qwen 3 32Bqwen-3-32b32 billion~2100 tokens/s
DeepSeek R1 Distill Llama 70B*deepseek-r1-distill-llama-70b70 billion~1700 tokens/s
* DeepSeek R1 Distill Llama 70B is available in private preview. Please contact us to request access.

QuickStart Guide

Get started by building your first application using our QuickStart guide.

import os
from cerebras.cloud.sdk import Cerebras

client = Cerebras(
  api_key=os.environ.get("CEREBRAS_API_KEY"),
)

chat_completion = client.chat.completions.create(
  messages=[
  {"role": "user", "content": "Why is fast inference important?",}
],
  model="llama-4-scout-17b-16e-instruct",
)