Prerequisites
Before you begin, ensure you have:- Cerebras API Key - Get a free API key here
- Python 3.10 or higher - LlamaIndex requires Python 3.10+
- Basic familiarity with LlamaIndex - Review the LlamaIndex documentation if you’re new to the framework
Configure LlamaIndex with Cerebras
1
Install required dependencies
Install the LlamaIndex core package and the OpenAI integration, which we’ll use to connect to Cerebras:
LlamaIndex provides a dedicated Cerebras integration that handles all the API configuration automatically.
2
Configure environment variables
Create a This keeps your credentials safe and makes it easy to switch between development and production environments.
.env file in your project directory to store your API key securely:3
Initialize the Cerebras LLM
Set up LlamaIndex to use Cerebras as your LLM provider by configuring the OpenAI integration to point to Cerebras’s API endpoint:This configuration tells LlamaIndex to route all LLM calls through Cerebras, giving you access to ultra-fast inference speeds.
4
Make your first query
Test the integration with a simple query to verify everything is working correctly:You should see a fast, coherent response explaining LlamaIndex. The speed difference compared to other providers will be immediately noticeable!
Streaming Responses
Cerebras’s ultra-fast inference makes streaming particularly impressive. Here’s how to stream responses in LlamaIndex:Advanced: Custom Query Pipeline
For more control over your RAG pipeline, you can create custom query pipelines with Cerebras:index.as_retriever(), and build a QueryPipeline with your desired modules like retrievers and summarizers.
Model Selection Guide
Choose the right Cerebras model for your use case:- llama-3.3-70b - Best for complex reasoning, long-form content, and tasks requiring deep understanding
- qwen-3-32b - Balanced performance for general-purpose applications
- llama3.1-8b - Fastest option for simple tasks and high-throughput scenarios
- gpt-oss-120b - Largest model for the most demanding tasks
- zai-glm-4.6 - Advanced 357B parameter model with strong reasoning capabilities
Next Steps
Now that you have LlamaIndex working with Cerebras, explore these advanced features:- Build a chatbot - Create conversational AI with memory using LlamaIndex’s chat engines
- Add structured outputs - Use Pydantic models for type-safe responses
- Implement agents - Build autonomous agents with LlamaIndex’s agent framework
- Optimize embeddings - Explore different embedding models for better retrieval
- Try different models - Experiment with Cerebras’s model lineup to find the best fit for your use case
- Try the latest model - GLM4.6 migration guide
Troubleshooting
Why am I getting 'Invalid API key' errors?
Why am I getting 'Invalid API key' errors?
Make sure you’re using your Cerebras API key, not an OpenAI key. Double-check that:
- Your
.envfile containsCEREBRAS_API_KEY=your-key-here - You’re loading the environment variable correctly with
os.getenv("CEREBRAS_API_KEY") - The API key is active in your Cerebras dashboard
Can I use LlamaIndex's async features with Cerebras?
Can I use LlamaIndex's async features with Cerebras?
Yes! Cerebras fully supports async operations. Use
await llm.achat() and await llm.astream_chat() for async calls:How do I handle rate limits?
How do I handle rate limits?
Cerebras has generous rate limits, but if you’re building a high-traffic application, consider:
- Implementing exponential backoff for retries
- Using LlamaIndex’s built-in retry logic
- Caching responses for common queries
- Contacting Cerebras support for enterprise rate limits
Why are my embeddings not working?
Why are my embeddings not working?
Cerebras currently provides LLM inference, not embedding models. For embeddings in your RAG pipeline, use a separate embedding provider like HuggingFace embeddings by setting
Settings.embed_model to a HuggingFaceEmbedding instance. This allows you to use Cerebras for generation while using specialized embedding models for retrieval.
