Skip to main content
Currently, we support the Chat Completion endpoint via the Hugging Face Python client. To get started, follow the steps below. Learn more on the Hugging Face website.
1

Install the Hugging Face Hub client

pip install huggingface_hub --upgrade
2

Create a new Hugging Face API key

Next, you’ll need to create a new Hugging Face API key. You’ll use this key to authenticate with Hugging Face and access the Cerebras provider.
  1. Go to hf.co/settings/tokens
  2. Click “New token”
  3. Give it a name and copy your API key
3

Make an API call

Here’s an example using InferenceClient to query Llama 3.3-70B on Cerebras.Be sure to replace "hf_your_api_key_here" with your actual API key.
from huggingface_hub import InferenceClient

client = InferenceClient(
    provider="cerebras",
    api_key="hf_your_api_key_here",
)

completion = client.chat.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct",
    messages=[
        {"role": "user", "content": "What is the capital of France?"}
    ],
    max_tokens=500,
)

print(completion.choices[0].message.content)

FAQ

Since inference is routed through Hugging Face’s proxy, users may experience slightly higher latency compared to calling Cerebras Cloud directly.
The official Hugging Face inference example uses a multimodal input call, which is not currently supported by Cerebras. To avoid this error, use the code provided in Step 3 of the tutorial above.
I