Skip to main content

What is Hugging Face?

Hugging Face Inference Providers is a unified API that gives you access to multiple AI inference providers, including Cerebras, through a single interface. This means you can use familiar Hugging Face tools and SDKs to access Cerebras’s world-class inference speed without changing your existing code structure. Key features include:
  • Unified API - Use the same code structure across multiple providers
  • Simple Integration - Works with OpenAI SDK, Hugging Face Hub client, and standard HTTP requests
  • Model Discovery - Browse all available Cerebras models through Hugging Face’s model hub
  • Flexible Authentication - Use your Hugging Face token to access Cerebras inference
Learn more on the Hugging Face Inference Providers documentation.

Prerequisites

Before you begin, ensure you have:
  • Hugging Face Account - Create a free account at huggingface.co
  • Hugging Face API Token - Generate a token at hf.co/settings/tokens
  • Python 3.7 or higher - Required for running the Python examples
While you can use Hugging Face tokens for authentication, you can also use your Cerebras API key directly. Get one here.

Getting Started

1

Install the required dependencies

You can use either the Hugging Face Hub client or the OpenAI SDK to access Cerebras through Hugging Face. Install your preferred client:
pip install huggingface_hub --upgrade
2

Set up your API token

Create a .env file in your project directory to store your Hugging Face token securely:
HF_TOKEN=hf_your_token_here
Alternatively, you can set it as an environment variable:
export HF_TOKEN=hf_your_token_here
Your Hugging Face token authenticates you with the Inference Providers API, which then routes your requests to Cerebras’s infrastructure.
3

Make your first inference request

Now you’re ready to make your first request to Cerebras through Hugging Face. Here’s how to use the chat completion endpoint with different clients:
import os
from huggingface_hub import InferenceClient

# Initialize the client with Cerebras provider
client = InferenceClient(
    provider="cerebras",
    api_key=os.getenv("HF_TOKEN"),

)

# Make a chat completion request
completion = client.chat.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct",
    messages=[
        {"role": "user", "content": "What is the capital of France?"}
    ],
    max_tokens=500,
)

print(completion.choices[0].message.content)
When using the OpenAI SDK, append :cerebras to the model name to specify Cerebras as the provider. With the Hugging Face Hub client, set provider="cerebras" instead.
4

Try streaming responses

For real-time applications, you can stream responses token-by-token as they’re generated. This is especially useful for chat interfaces and interactive applications where you want to display results as they arrive:
import os
from huggingface_hub import InferenceClient

client = InferenceClient(
    provider="cerebras",
    api_key=os.getenv("HF_TOKEN"),
)

# Enable streaming with stream=True
stream = client.chat.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct",
    messages=[
        {"role": "user", "content": "Write a short poem about AI."}
    ],
    max_tokens=500,
    stream=True,
)

# Process the stream
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Available Cerebras Models

View all available models on the Hugging Face model hub.

Advanced Usage

Using Custom Parameters

You can customize your requests with additional parameters supported by the Cerebras API to control response generation:
import os
from huggingface_hub import InferenceClient

client = InferenceClient(
    provider="cerebras",
    api_key=os.getenv("HF_TOKEN"),
)

completion = client.chat.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct",
    messages=[
        {"role": "user", "content": "Explain quantum computing."}
    ],
    max_tokens=1000,
    temperature=0.7,  # Controls randomness (0.0-1.0)
    top_p=0.9,        # Nucleus sampling threshold
    seed=42,          # For reproducible outputs
)

print(completion.choices[0].message.content)

Error Handling

Implement proper error handling to manage API errors gracefully in production applications:
import os
from huggingface_hub import InferenceClient
from huggingface_hub.errors import HfHubHTTPError

client = InferenceClient(
    provider="cerebras",
    api_key=os.getenv("HF_TOKEN"),
)

try:
    completion = client.chat.completions.create(
        model="meta-llama/Llama-3.3-70B-Instruct",
        messages=[
            {"role": "user", "content": "Hello!"}
        ],
        max_tokens=500,
    )
    print(completion.choices[0].message.content)
except HfHubHTTPError as e:
    print(f"API Error: {e}")
except Exception as e:
    print(f"Unexpected error: {e}")

Using Environment Variables

For better security and configuration management, load environment variables using python-dotenv:
import os
from dotenv import load_dotenv
from huggingface_hub import InferenceClient

# Load environment variables from .env file
load_dotenv()

# Verify token is loaded
token = os.getenv("HF_TOKEN")
if not token:
    raise ValueError("HF_TOKEN not found in environment variables")

client = InferenceClient(
    provider="cerebras",
    api_key=token,
)

completion = client.chat.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct",
    messages=[
        {"role": "user", "content": "Hello!"}
    ],
    max_tokens=500,
)

print(completion.choices[0].message.content)

Next Steps

FAQ

Since inference is routed through Hugging Face’s proxy, users may experience slightly higher latency compared to calling Cerebras Cloud directly. The overhead is typically minimal (10-50ms), but for applications requiring the absolute lowest latency, consider using the Cerebras API directly.However, Hugging Face Inference Providers offers benefits like:
  • Unified API across multiple providers
  • Simplified authentication with Hugging Face tokens
  • Integration with Hugging Face’s ecosystem and tools
  • Easy provider switching without code changes
The official Hugging Face inference example uses a multimodal input call, which is not currently supported by Cerebras.Cerebras currently supports:
  • Text-based chat completions
  • Standard message formats with role and content
  • Streaming responses
  • Common parameters (temperature, top_p, max_tokens, etc.)
Multimodal inputs (images, audio, etc.) are not yet supported.
When using the Hugging Face Hub client, use the full model name from Hugging Face:
model="meta-llama/Llama-3.3-70B-Instruct"
When using the OpenAI SDK through Hugging Face router, append :cerebras to specify the provider:
model="meta-llama/Llama-3.3-70B-Instruct:cerebras"
You can find all available models at huggingface.co/models?inference_provider=cerebras.
Make sure you’re using a valid Hugging Face token with the correct permissions. You can generate a new token at hf.co/settings/tokens. The token should have at least read access.If you’re using environment variables, ensure they’re properly loaded:
import os
from dotenv import load_dotenv

load_dotenv()  # Load .env file
token = os.getenv("HF_TOKEN")
if not token:
    raise ValueError("HF_TOKEN not found in environment variables")
Common issues:
  • Token not set in environment variables
  • Token has expired or been revoked
  • Token doesn’t have necessary permissions
  • Typo in token value
Yes! Hugging Face Inference Providers is production-ready and used by many applications. However, consider these factors:
  • Latency: The routing layer adds minimal overhead, but direct API calls to Cerebras will be slightly faster
  • Rate Limits: Check Hugging Face’s rate limits for your account tier
  • Monitoring: Implement proper logging and error handling for production use
  • Reliability: Both Hugging Face and Cerebras maintain high uptime SLAs
  • Costs: Review pricing for both Hugging Face and Cerebras services
For mission-critical applications requiring the absolute lowest latency, consider using the Cerebras API directly. For applications that benefit from a unified API across multiple providers, Hugging Face Inference Providers is an excellent choice.
Both clients work well with Cerebras through Hugging Face, but there are some differences:Hugging Face Hub Client:
  • Native integration with Hugging Face ecosystem
  • Set provider explicitly with provider="cerebras"
  • Use standard Hugging Face model names
  • Better integration with Hugging Face datasets and tools
OpenAI SDK:
  • Familiar interface if you’re already using OpenAI
  • Append :cerebras to model names
  • Easy migration from OpenAI to Cerebras
  • Compatible with OpenAI-style tooling
Choose based on your existing codebase and preferences. Both provide the same underlying functionality and performance.