The Cerebras Inference API offers developers a low-latency solution for AI model inference powered by Cerebras Wafer-Scale Engines and CS-3 systems. We invite developers to explore the new possibilities that our high-speed inferencing solution unlocks.

The Cerebras Inference API currently provides access to models from Meta’s Llama family, including Llama 4 Scout and Llama 3.3 70B, as well as DeepSeek R1 Distill Llama 70B (available upon request).

Model NameModel IDParametersKnowledge
Llama 4 Scoutllama-4-scout-17b-16e-instruct109 billionAugust 2024
Llama 4 MaverickComing Soon!400 billionAugust 2024
Llama 3.1 8Bllama3.1-8b8 billionMarch 2023
Llama 3.3 70Bllama-3.3-70b70 billionDecember 2023
DeepSeek R1 Distill Llama 70B*deepSeek-r1-distill-llama-70B70 billionDecember 2023
* DeepSeek R1 Distill Llama 70B is available in private preview. Please contact us to request access.
Our free tier supports a context length of 8,192 tokens. For all supported models, we also offer context lengths up to 128K upon request. To gain access, please contact us here!

QuickStart Guide

Get started by building your first application using our QuickStart guide.

import os
from cerebras.cloud.sdk import Cerebras

client = Cerebras(
  api_key=os.environ.get("CEREBRAS_API_KEY"),
)

chat_completion = client.chat.completions.create(
  messages=[
  {"role": "user", "content": "Why is fast inference important?",}
],
  model="llama-4-scout-17b-16e-instruct",
)

Was this page helpful?