Skip to main content

What is LiteLLM?

LiteLLM is a lightweight Python package that provides a unified interface for calling 100+ LLM providers (OpenAI, Azure, Anthropic, Cohere, Cerebras, Replicate, PaLM, and more) using the OpenAI format. With LiteLLM, you can easily switch between different LLM providers, including Cerebras, without changing your code structure.

Prerequisites

Before you begin, ensure you have:
  • Cerebras API Key - Get a free API key here.
  • Python 3.7 or higher - LiteLLM requires a modern Python environment.
  • pip package manager - For installing the LiteLLM library.

Configure LiteLLM

1

Install LiteLLM

Install the LiteLLM package using pip:
pip install litellm
This will install LiteLLM and all its dependencies, including the OpenAI SDK which LiteLLM uses under the hood.
2

Set up your environment variables

Create a .env file in your project directory to securely store your API key:
CEREBRAS_API_KEY=your-cerebras-api-key-here
Alternatively, you can export the environment variable in your terminal:
export CEREBRAS_API_KEY=your-cerebras-api-key-here
3

Make your first request with LiteLLM

LiteLLM provides a simple completion() function that works across all providers. Here’s how to call Cerebras models:
import os
from litellm import completion

# Set your Cerebras API key
os.environ["CEREBRAS_API_KEY"] = os.getenv("CEREBRAS_API_KEY")

# Make a completion request to Cerebras
response = completion(
    model="cerebras/llama-3.3-70b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"}
    ],
    api_base="https://api.cerebras.ai/v1",
    custom_llm_provider="cerebras",
    extra_headers={
        "X-Cerebras-3rd-Party-Integration": "litellm"
    }
)

print(response.choices[0].message.content)
The cerebras/ prefix tells LiteLLM to route the request to Cerebras, and the integration header ensures proper tracking and support.
4

Use streaming responses

LiteLLM supports streaming responses, which is useful for real-time applications where you want to display tokens as they’re generated:
import os
from litellm import completion

os.environ["CEREBRAS_API_KEY"] = os.getenv("CEREBRAS_API_KEY")

# Enable streaming with stream=True
response = completion(
    model="cerebras/llama-3.3-70b",
    messages=[
        {"role": "user", "content": "Write a short poem about artificial intelligence."}
    ],
    api_base="https://api.cerebras.ai/v1",
    custom_llm_provider="cerebras",
    stream=True,
    extra_headers={
        "X-Cerebras-3rd-Party-Integration": "litellm"
    }
)

# Process the stream
for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")
Streaming is particularly powerful with Cerebras’s fast inference speeds, allowing you to deliver near-instantaneous responses to your users.
5

Try different Cerebras models

Cerebras offers several high-performance models optimized for different use cases. Here’s how to use them with LiteLLM:
import os
from litellm import completion

os.environ["CEREBRAS_API_KEY"] = os.getenv("CEREBRAS_API_KEY")

# Try Llama 3.1 8B for faster responses
response_8b = completion(
    model="cerebras/llama3.1-8b",
    messages=[{"role": "user", "content": "Hello!"}],
    api_base="https://api.cerebras.ai/v1",
    custom_llm_provider="cerebras",
    extra_headers={"X-Cerebras-3rd-Party-Integration": "litellm"}
)

# Try Qwen 3 32B for balanced performance
response_32b = completion(
    model="cerebras/qwen-3-32b",
    messages=[{"role": "user", "content": "Explain quantum computing."}],
    api_base="https://api.cerebras.ai/v1",
    custom_llm_provider="cerebras",
    extra_headers={"X-Cerebras-3rd-Party-Integration": "litellm"}
)

# Try Llama 3.3 70B for most capable responses
response_70b = completion(
    model="cerebras/llama-3.3-70b",
    messages=[{"role": "user", "content": "Write a detailed analysis."}],
    api_base="https://api.cerebras.ai/v1",
    custom_llm_provider="cerebras",
    extra_headers={"X-Cerebras-3rd-Party-Integration": "litellm"}
)
Choose the model that best fits your latency, cost, and capability requirements.

Advanced Features

Using LiteLLM’s Router for Load Balancing

LiteLLM’s Router allows you to load balance across multiple models or providers, including Cerebras. This is useful for distributing traffic and implementing fallback strategies:
import os
from litellm import Router

os.environ["CEREBRAS_API_KEY"] = os.getenv("CEREBRAS_API_KEY")

# Define model configurations
model_list = [
    {
        "model_name": "cerebras-llama",
        "litellm_params": {
            "model": "cerebras/llama-3.3-70b",
            "api_base": "https://api.cerebras.ai/v1",
            "custom_llm_provider": "cerebras",
            "extra_headers": {"X-Cerebras-3rd-Party-Integration": "litellm"}
        }
    },
    {
        "model_name": "cerebras-qwen",
        "litellm_params": {
            "model": "cerebras/qwen-3-32b",
            "api_base": "https://api.cerebras.ai/v1",
            "custom_llm_provider": "cerebras",
            "extra_headers": {"X-Cerebras-3rd-Party-Integration": "litellm"}
        }
    }
]

# Initialize router
router = Router(model_list=model_list)

# Make a request - router will automatically load balance
response = router.completion(
    model="cerebras-llama",
    messages=[{"role": "user", "content": "Hello!"}]
)

print(response.choices[0].message.content)

Fallback and Retry Logic

LiteLLM supports automatic fallbacks between providers, which is useful for building resilient applications:
import os
from litellm import completion

os.environ["CEREBRAS_API_KEY"] = os.getenv("CEREBRAS_API_KEY")

# Define fallback models
response = completion(
    model="cerebras/llama-3.3-70b",
    messages=[{"role": "user", "content": "Hello!"}],
    api_base="https://api.cerebras.ai/v1",
    custom_llm_provider="cerebras",
    fallbacks=["cerebras/llama3.1-8b", "cerebras/qwen-3-32b"],
    extra_headers={"X-Cerebras-3rd-Party-Integration": "litellm"}
)
If the primary model fails or is unavailable, LiteLLM will automatically try the fallback models in order.

Cost Tracking and Budgets

LiteLLM includes built-in cost tracking to help you monitor your API usage:
import os
from litellm import completion, completion_cost

os.environ["CEREBRAS_API_KEY"] = os.getenv("CEREBRAS_API_KEY")

response = completion(
    model="cerebras/llama-3.3-70b",
    messages=[{"role": "user", "content": "Hello!"}],
    api_base="https://api.cerebras.ai/v1",
    custom_llm_provider="cerebras",
    extra_headers={"X-Cerebras-3rd-Party-Integration": "litellm"}
)

# Calculate the cost of this completion
cost = completion_cost(completion_response=response)
print(f"Cost: ${cost}")

Why Use LiteLLM with Cerebras?

  • Unified Interface: LiteLLM provides a consistent API across 100+ providers, making it easy to experiment with different models or migrate between providers without rewriting code.
  • Production-Ready Features: Built-in support for retries, fallbacks, load balancing, and cost tracking.
  • Observability: Integrate with logging and monitoring tools to track your LLM usage and performance.
  • Speed Meets Flexibility: Combine Cerebras’s industry-leading inference speed with LiteLLM’s flexible routing and management capabilities.

FAQ

Yes! LiteLLM Proxy allows you to create a centralized gateway for all your LLM requests. You can configure Cerebras as one of your providers in the proxy configuration file:
model_list:
  - model_name: cerebras-llama
    litellm_params:
      model: cerebras/llama-3.3-70b
      api_base: https://api.cerebras.ai/v1
      custom_llm_provider: openai
      extra_headers:
        X-Cerebras-3rd-Party-Integration: litellm
Learn more in the LiteLLM Proxy documentation.
LiteLLM provides built-in retry logic with exponential backoff. You can configure this behavior:
from litellm import completion

response = completion(
    model="cerebras/llama-3.3-70b",
    messages=[{"role": "user", "content": "Hello!"}],
    api_base="https://api.cerebras.ai/v1",
    custom_llm_provider="cerebras",
    num_retries=3,
    extra_headers={"X-Cerebras-3rd-Party-Integration": "litellm"}
)
You can also use the Router to distribute load across multiple API keys or models.
Yes! LiteLLM supports async operations using acompletion():
import os
import asyncio
from litellm import acompletion

os.environ["CEREBRAS_API_KEY"] = os.getenv("CEREBRAS_API_KEY")

async def main():
    response = await acompletion(
        model="cerebras/llama-3.3-70b",
        messages=[{"role": "user", "content": "Hello!"}],
        api_base="https://api.cerebras.ai/v1",
        custom_llm_provider="cerebras",
        extra_headers={"X-Cerebras-3rd-Party-Integration": "litellm"}
    )
    print(response.choices[0].message.content)

asyncio.run(main())
This is particularly useful for building high-performance applications that need to handle multiple concurrent requests.

Next Steps

Troubleshooting

Authentication Errors

If you encounter authentication errors, verify that:
  • Your CEREBRAS_API_KEY environment variable is set correctly
  • The API key is valid and hasn’t expired
  • You’re using the correct api_base URL: https://api.cerebras.ai/v1

Model Not Found Errors

Ensure you’re using the correct model name format:
  • Use cerebras/llama-3.3-70b (with the cerebras/ prefix)
  • Check the available models to confirm the model name
  • Note that model names are case-sensitive

Rate Limiting

If you hit rate limits:
  • Implement exponential backoff using LiteLLM’s built-in retry logic with num_retries
  • Consider using the Router for load balancing across multiple API keys

Additional Resources