Overview

The Cerebras Inference API offers developers a low-latency solution for AI model inference powered by Cerebras Wafer-Scale Engines and CS-3 systems. We invite developers to explore the new possibilities that our high-speed inferencing solution unlocks.

To get started with a free API key, click here.

The Cerebras Inference API currently provides access to the following models:

Model Name	Model ID	Parameters	Speed (tokens/s)
Llama 4 Scout	`llama-4-scout-17b-16e-instruct`	109 billion	~2600 tokens/s
Llama 3.1 8B	`llama3.1-8b`	8 billion	~2200 tokens/s
Llama 3.3 70B	`llama-3.3-70b`	70 billion	~2100 tokens/s
Qwen 3 32B*	`qwen-3-32b`	32 billion	~2100 tokens/s
DeepSeek R1 Distill Llama 70B*	`deepseek-r1-distill-llama-70b`	70 billion	~1700 tokens/s

* Qwen 3 is a hybrid reasoning model that can operate in two modes: with or without thinking tokens. Currently, we only support (the default) reasoning mode. However, on queries where you do not want reasoning, you can suggest that the model does not reason by passing /no_think in the prompt. For example: "Write a python script to calculate the area of a circle /no_think".

* DeepSeek R1 Distill Llama 70B is available in private preview. Please contact us to request access.

QuickStart Guide

Get started by building your first application using our QuickStart guide.

import os
from cerebras.cloud.sdk import Cerebras

client = Cerebras(
  api_key=os.environ.get("CEREBRAS_API_KEY"),
)

chat_completion = client.chat.completions.create(
  messages=[
  {"role": "user", "content": "Why is fast inference important?",}
],
  model="llama-4-scout-17b-16e-instruct",
)

Play with our live chatbot demo.
For information on pricing and context length, visit our pricing page.
Experiment with our inference solution in the playground before making an API call.
Explore our API reference documentation.

Get Started

Capabilities

Resources

AI Agent Bootcamp

Support

QuickStart Guide