Cerebras Inference on Hugging Face
Learn how to use Cerebras Inference on Hugging Face.
This guide will walk you step-by-step through using the Hugging Face InferenceClient to run inference on Cerebras hardware. Hugging Face acts as our “pay-as-you-go“ provider. We currently offer Llama 3.3 models and Llama 4 Scout through Hugging Face.
Currently, we support the Chat Completion endpoint via the Hugging Face Python client. To get started, follow the steps below.
Install the Hugging Face Hub client
Create a new Hugging Face API key
Next, you’ll need to create a new Hugging Face API key. You’ll use this key to authenticate with Hugging Face and access the Cerebras provider.
- Go to hf.co/settings/tokens
- Click “New token”
- Give it a name and copy your API key
Make an API call
Here’s an example using InferenceClient to query Llama 3.3-70B on Cerebras.
Be sure to replace "hf_your_api_key_here"
with your actual API key.
Differences Between Cerebras Cloud and Hugging Face
Cerebras Cloud is primarily intended for free tier users and high-throughput startups that need a dedicated plan to handle their inference. Hugging Face acts as our “pay-as-you-go“ provider. We currently offer Llama 3.3 models and Llama 4 Scout through Hugging Face.
DeepSeek r1-distilled-70b can only be accessed on Cerebras Cloud with a paid plan.
FAQ
What context length can I run?
What context length can I run?
Although the Hugging Face model card may mention support for up to 10 million tokens, Cerebras currently supports a maximum context length of 16k.
What additional latency can I expect when using Cerebras through Hugging Face?
What additional latency can I expect when using Cerebras through Hugging Face?
Since inference is routed through Hugging Face’s proxy, users may experience slightly higher latency compared to calling Cerebras Cloud directly.
Why do I see “Wrong API Format“ when running the Hugging Face test code?
Why do I see “Wrong API Format“ when running the Hugging Face test code?
The official Hugging Face inference example uses a multimodal input call, which is not currently supported by Cerebras. To avoid this error, use the code provided in Step 3 of the tutorial above.