The Cerebras API offers developers a low-latency solution for AI model inference powered by Cerebras Wafer-Scale Engines and CS-3 systems. We invite developers to explore the new possibilities that our high-speed inferencing solution unlocks.

Currently, the Cerebras API provides access to Meta’s Llama 3.1 8B and Llama 3.3 70B models, as well as DeepSeek R1 Distill Llama 70B (available upon request). All models are instruction-tuned and can be used for conversational applications.

Model NameModel IDParametersKnowledgeContext
Llama 3.1 8Bllama3.1-8b8 billionMarch 20238192
Llama 3.3 70Bllama-3.3-70b70 billionDecember 20238192
DeepSeek R1 Distill Llama 70B*deepSeek-r1-distill-llama-70B70 billionDecember 20238192
*DeepSeek R1 Distill Llama 70B is available upon request. Please contact us to request access.
Due to high demand in our early launch phase, we are temporarily limiting Llama 3.1 and 3.3 models to a context window of 8192 in our Free Tier. If your use case or application would benefit from longer context windows, please let us know!

QuickStart Guide

Get started by building your first application using our QuickStart guide.

import os
from cerebras.cloud.sdk import Cerebras

client = Cerebras(
  api_key=os.environ.get("CEREBRAS_API_KEY"),
)

chat_completion = client.chat.completions.create(
  messages=[
  {"role": "user", "content": "Why is fast inference important?",}
],
  model="llama3.1-8b",
)

Was this page helpful?