This guide will walk you step-by-step through using the Hugging Face InferenceClient to run inference on Cerebras hardware. Hugging Face acts as our “pay-as-you-go“ provider. We currently offer Llama 3.3 models and Llama 4 Scout through Hugging Face.

Currently, we support the Chat Completion endpoint via the Hugging Face Python client. To get started, follow the steps below.

1

Install the Hugging Face Hub client

pip install huggingface_hub --upgrade
2

Create a new Hugging Face API key

Next, you’ll need to create a new Hugging Face API key. You’ll use this key to authenticate with Hugging Face and access the Cerebras provider.

  1. Go to hf.co/settings/tokens
  2. Click “New token”
  3. Give it a name and copy your API key
3

Make an API call

Here’s an example using InferenceClient to query Llama 3.3-70B on Cerebras.

Be sure to replace "hf_your_api_key_here" with your actual API key.

from huggingface_hub import InferenceClient

client = InferenceClient(
    provider="cerebras",
    api_key="hf_your_api_key_here",
)

completion = client.chat.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct",
    messages=[
        {"role": "user", "content": "What is the capital of France?"}
    ],
    max_tokens=500,
)

print(completion.choices[0].message.content)

Differences Between Cerebras Cloud and Hugging Face

Cerebras Cloud is primarily intended for free tier users and high-throughput startups that need a dedicated plan to handle their inference. Hugging Face acts as our “pay-as-you-go“ provider. We currently offer Llama 3.3 models and Llama 4 Scout through Hugging Face.

DeepSeek r1-distilled-70b can only be accessed on Cerebras Cloud with a paid plan.

FAQ