Cerebras Inference on Hugging Face

What is Hugging Face Inference Providers?

Hugging Face Inference Providers is a unified API that gives you access to multiple AI inference providers, including Cerebras, through a single interface. This means you can use familiar Hugging Face tools and SDKs to access Cerebras’s world-class inference speed without changing your existing code structure. Key features include:

Unified API - Use the same code structure across multiple providers
Simple Integration - Works with OpenAI SDK, Hugging Face Hub client, and standard HTTP requests
Model Discovery - Browse all available Cerebras models through Hugging Face’s model hub
Flexible Authentication - Use your Hugging Face token to access Cerebras inference

Learn more on the Hugging Face Inference Providers documentation.

Prerequisites

Before you begin, ensure you have:

Hugging Face Account - Create a free account at huggingface.co
Hugging Face API Token - Generate a token at hf.co/settings/tokens
Python 3.11 or higher - Required for running the Python examples

While you can use Hugging Face tokens for authentication, you can also use your Cerebras API key directly. Get one here.

Getting Started

Install the required dependencies

You can use either the Hugging Face Hub client or the OpenAI SDK to access Cerebras through Hugging Face. Install your preferred client:

pip install huggingface_hub

Set up your API token

Create a .env file in your project directory to store your Hugging Face token securely:

HF_TOKEN=hf_your_token_here

Alternatively, you can set it as an environment variable:

export HF_TOKEN=hf_your_token_here

Your Hugging Face token authenticates you with the Inference Providers API, which then routes your requests to Cerebras’s infrastructure.

Make your first inference request

Now you’re ready to make your first request to Cerebras through Hugging Face. Here’s how to use the chat completion endpoint with different clients:

import os
from huggingface_hub import InferenceClient

# Initialize the client with Cerebras provider
client = InferenceClient(
    provider="cerebras",
    api_key=os.getenv("HF_TOKEN"),
)

# Make a chat completion request
completion = client.chat.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct",
    messages=[
        {"role": "user", "content": "What is the capital of France?"}
    ],
    max_tokens=500,
)

print(completion.choices[0].message.content)

When using the OpenAI SDK, append :cerebras to the model name to specify Cerebras as the provider. With the Hugging Face Hub client, set provider="cerebras" instead.

Try streaming responses

For real-time applications, you can stream responses token-by-token as they’re generated. This is especially useful for chat interfaces and interactive applications where you want to display results as they arrive:

import os
from huggingface_hub import InferenceClient

client = InferenceClient(
    provider="cerebras",
    api_key=os.getenv("HF_TOKEN"),
)

# Enable streaming with stream=True
stream = client.chat.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct",
    messages=[
        {"role": "user", "content": "Write a short poem about AI."}
    ],
    max_tokens=500,
    stream=True,
)

# Process the stream
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Available Cerebras Models

You can use any of the following Cerebras models through Hugging Face Inference Providers:

llama-3.3-70b: Best for complex reasoning, long-form content, and tasks requiring deep understanding
qwen-3-32b: Balanced performance for general-purpose applications
llama3.1-8b: Fastest option for simple tasks and high-throughput scenarios
gpt-oss-120b: Largest model for the most demanding tasks
zai-glm-4.6: Advanced 357B parameter model with strong reasoning capabilities

View all available models on the Hugging Face model hub.

Advanced Usage

Using Custom Parameters

You can customize your requests with additional parameters supported by the Cerebras API to control response generation:

import os
from huggingface_hub import InferenceClient

client = InferenceClient(
    provider="cerebras",
    api_key=os.getenv("HF_TOKEN"),
)

completion = client.chat.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct",
    messages=[
        {"role": "user", "content": "Explain quantum computing."}
    ],
    max_tokens=1000,
    temperature=0.7,  # Controls randomness (0.0-1.0)
    top_p=0.9,        # Nucleus sampling threshold
    seed=42,          # For reproducible outputs
)

print(completion.choices[0].message.content)

Error Handling

Implement proper error handling to manage API errors gracefully in production applications:

import os
from huggingface_hub import InferenceClient
from huggingface_hub.errors import HfHubHTTPError

client = InferenceClient(
    provider="cerebras",
    api_key=os.getenv("HF_TOKEN"),
)

try:
    completion = client.chat.completions.create(
        model="meta-llama/Llama-3.3-70B-Instruct",
        messages=[
            {"role": "user", "content": "Hello!"}
        ],
        max_tokens=500,
    )
    print(completion.choices[0].message.content)
except HfHubHTTPError as e:
    print(f"API Error: {e}")
except Exception as e:
    print(f"Unexpected error: {e}")

Using Environment Variables

For better security and configuration management, load environment variables using python-dotenv:

pip install python-dotenv huggingface_hub

import os
from dotenv import load_dotenv
from huggingface_hub import InferenceClient

# Load environment variables from .env file
load_dotenv()

# Verify token is loaded
token = os.getenv("HF_TOKEN")
if not token:
    raise ValueError("HF_TOKEN not found in environment variables")

client = InferenceClient(
    provider="cerebras",
    api_key=token,
)

completion = client.chat.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct",
    messages=[
        {"role": "user", "content": "Hello!"}
    ],
    max_tokens=500,
)

print(completion.choices[0].message.content)

Next Steps

Explore the Hugging Face Inference Providers documentation for more advanced features
Browse Cerebras models on Hugging Face
Learn about Chat Completion parameters in our API reference
Try different Cerebras models to find the best fit for your use case
Check out Hugging Face’s guide on building AI apps
Explore structured outputs with LLMs for JSON generation
Migrate to GLM4.6: Ready to upgrade? Follow our migration guide to start using our latest model

FAQ

What additional latency can I expect when using Cerebras through Hugging Face?

Since inference is routed through Hugging Face’s proxy, users may experience slightly higher latency compared to calling Cerebras Cloud directly. The overhead is typically minimal (10-50ms), but for applications requiring the absolute lowest latency, consider using the Cerebras API directly.However, Hugging Face Inference Providers offers benefits like:

Unified API across multiple providers
Simplified authentication with Hugging Face tokens
Integration with Hugging Face’s ecosystem and tools
Easy provider switching without code changes

Why do I see 'Wrong API Format' when running the Hugging Face test code?

The official Hugging Face inference example uses a multimodal input call, which is not currently supported by Cerebras. To avoid this error, use the code provided in Step 3 of the tutorial above.Cerebras currently supports:

Text-based chat completions
Standard message formats with role and content
Streaming responses
Common parameters (temperature, top_p, max_tokens, etc.)

Multimodal inputs (images, audio, etc.) are not yet supported.

How do I specify which Cerebras model to use?

When using the Hugging Face Hub client, use the full model name from Hugging Face:

model="meta-llama/Llama-3.3-70B-Instruct"

When using the OpenAI SDK through Hugging Face router, append :cerebras to specify the provider:

model="meta-llama/Llama-3.3-70B-Instruct:cerebras"

You can find all available models at huggingface.co/models?other=cerebras.

Why am I getting authentication errors?

Make sure you’re using a valid Hugging Face token with the correct permissions. You can generate a new token at hf.co/settings/tokens. The token should have at least read access.If you’re using environment variables, ensure they’re properly loaded:

import os
from dotenv import load_dotenv

load_dotenv()  # Load .env file
token = os.getenv("HF_TOKEN")
if not token:
    raise ValueError("HF_TOKEN not found in environment variables")

Common issues:

Token not set in environment variables
Token has expired or been revoked
Token doesn’t have necessary permissions
Typo in token value

Can I use Hugging Face Inference Providers for production applications?

Yes! Hugging Face Inference Providers is production-ready and used by many applications. However, consider these factors:

Latency: The routing layer adds minimal overhead, but direct API calls to Cerebras will be slightly faster
Rate Limits: Check Hugging Face’s rate limits for your account tier
Monitoring: Implement proper logging and error handling for production use
Reliability: Both Hugging Face and Cerebras maintain high uptime SLAs
Costs: Review pricing for both Hugging Face and Cerebras services

For mission-critical applications requiring the absolute lowest latency, consider using the Cerebras API directly. For applications that benefit from a unified API across multiple providers, Hugging Face Inference Providers is an excellent choice.

What's the difference between using the Hugging Face Hub client and OpenAI SDK?

Both clients work well with Cerebras through Hugging Face, but there are some differences:Hugging Face Hub Client:

Native integration with Hugging Face ecosystem
Set provider explicitly with provider="cerebras"
Use standard Hugging Face model names
Better integration with Hugging Face datasets and tools

OpenAI SDK:

Familiar interface if you’re already using OpenAI
Append :cerebras to model names
Easy migration from OpenAI to Cerebras
Compatible with OpenAI-style tooling

Choose based on your existing codebase and preferences. Both provide the same underlying functionality and performance.

Get Started

Capabilities

Resources

Support

Cerebras Inference on Hugging Face

What is Hugging Face Inference Providers?

Prerequisites

Getting Started

Available Cerebras Models

Advanced Usage

Using Custom Parameters

Error Handling

Using Environment Variables

Next Steps

FAQ

Get Started

Capabilities

Resources

Support

​What is Hugging Face Inference Providers?

​Prerequisites

​Getting Started

​Available Cerebras Models

​Advanced Usage

​Using Custom Parameters

​Error Handling

​Using Environment Variables

​Next Steps

​FAQ

What is Hugging Face Inference Providers?

Prerequisites

Getting Started

Available Cerebras Models

Advanced Usage

Using Custom Parameters

Error Handling

Using Environment Variables

Next Steps

FAQ