The Cerebras Inference API offers developers a low-latency solution for AI model inference powered by Cerebras Wafer-Scale Engines and CS-3 systems. We invite developers to explore the new possibilities that our high-speed inferencing solution unlocks.

The Cerebras Inference API currently provides access to the following models:

Model NameModel IDParametersSpeed (tokens/s)
Llama 4 Scoutllama-4-scout-17b-16e-instruct109 billion~2600 tokens/s
Llama 3.1 8Bllama3.1-8b8 billion~2200 tokens/s
Llama 3.3 70Bllama-3.3-70b70 billion~2100 tokens/s
Qwen 3 32B*qwen-3-32b32 billion~2100 tokens/s
DeepSeek R1 Distill Llama 70B*deepseek-r1-distill-llama-70b70 billion~1700 tokens/s
* Qwen 3 is a hybrid reasoning model that can operate in two modes: with or without thinking tokens. Currently, we only support (the default) reasoning mode. However, on queries where you do not want reasoning, you can suggest that the model does not reason by passing /no_think in the prompt. For example: "Write a python script to calculate the area of a circle /no_think".
* DeepSeek R1 Distill Llama 70B is available in private preview. Please contact us to request access.

QuickStart Guide

Get started by building your first application using our QuickStart guide.

import os
from cerebras.cloud.sdk import Cerebras

client = Cerebras(
  api_key=os.environ.get("CEREBRAS_API_KEY"),
)

chat_completion = client.chat.completions.create(
  messages=[
  {"role": "user", "content": "Why is fast inference important?",}
],
  model="llama-4-scout-17b-16e-instruct",
)