The Cerebras API offers developers a low-latency solution for AI model inference powered by Cerebras Wafer-Scale Engines and CS-3 systems. We invite developers to explore the new possibilities that our high-speed inferencing solution unlocks.

Currently, the Cerebras API provides access to two models: Meta’s Llama 3.1 8B and 70B models. Both models are instruction-tuned and can be used for conversational applications.

Llama 3.1 8B

  • Model ID: llama3.1-8b

  • Parameters: 8 billion

  • Knowledge cutoff: March 2023

  • Context Length: 8192

  • Training Tokens: 15 trillion

Llama 3.1 70B

  • Model ID: llama3.1-70b

  • Parameters: 70 billion

  • Knowledge cutoff: December 2023

  • Context Length: 8192

  • Training Tokens: 15 trillion

Due to high demand in our early launch phase, we are temporarily limiting Llama 3.1 models to a context window of 8192 in our Free Tier. If your use case or application would benefit from longer context windows, please let us know!

QuickStart Guide

Get started by building your first application using our QuickStart guide.

import os
from cerebras.cloud.sdk import Cerebras

client = Cerebras(
  api_key=os.environ.get("CEREBRAS_API_KEY"),
)

chat_completion = client.chat.completions.create(
  messages=[
  {"role": "user", "content": "Why is fast inference important?",}
],
  model="llama3.1-8b",
)

Was this page helpful?