Prerequisites
Before you begin, ensure you have:- Cerebras API Key - Get a free API key here
- Milvus Instance - Either install Milvus locally or use Zilliz Cloud (managed Milvus)
- Python 3.8 or higher
- Basic understanding of vector databases and RAG concepts
Configure Milvus with Cerebras
1
Create a virtual environment
First, create and activate a virtual environment to keep your dependencies isolated:
2
Install required dependencies
Install the necessary packages for working with Milvus and Cerebras:These packages provide:
openai- For connecting to Cerebras API (OpenAI-compatible)pymilvus- Python SDK for Milvus vector database
3
Configure environment variables
Create a
.env file in your project directory to store your API credentials:If you’re using Milvus locally with Docker, the default URI is
http://localhost:19530. For Zilliz Cloud, you’ll receive a URI and token when you create your cluster.4
Initialize the Cerebras client
Set up the Cerebras client using the OpenAI SDK. This client will be used for generating chat completions in your RAG pipeline:This client will handle all interactions with Cerebras models for generating responses based on retrieved context.
5
Connect to Milvus
Establish a connection to your Milvus instance to begin storing and retrieving vectors:
6
Create a collection for your embeddings
Define and create a Milvus collection to store your document embeddings. This example uses a 1024-dimensional embedding space suitable for modern embedding models:
The embedding dimension (1024) should match the output dimension of your embedding model. Common dimensions are 768, 1024, or 1536 depending on your chosen embedding provider.
7
Generate and store embeddings
Create embeddings for your documents using your preferred embedding provider and store them in Milvus. This example shows the structure for inserting documents:
8
Create an index for fast retrieval
Create an index on the embedding field to enable fast similarity search. The IVF_FLAT index provides a good balance of speed and accuracy:
For production use, consider HNSW index for better performance with larger datasets. Adjust
nlist based on your data size: use higher values (1024-4096) for millions of vectors.9
Build a RAG query pipeline
Now create a complete RAG pipeline that retrieves relevant documents from Milvus and generates responses using Cerebras:This pipeline retrieves the most relevant documents from Milvus based on semantic similarity, then uses Cerebras’s fast inference to generate a contextually-aware response.
10
Stream responses for better UX
For a better user experience, enable streaming to display responses as they’re generated:
Next Steps
- Migrate to GLM4.6: Ready to upgrade? Follow our migration guide to start using our latest model
- Explore Milvus documentation for advanced features like hybrid search and filtering
- Try different Cerebras models to optimize for your use case
- Learn about Milvus indexing strategies for better performance
- Check out Zilliz Cloud for a fully managed Milvus experience
- Experiment with different embedding models and dimensions for your specific domain
- Implement hybrid search combining vector and scalar filtering
Troubleshooting
Connection refused when connecting to Milvus
Connection refused when connecting to Milvus
If you’re running Milvus locally with Docker, ensure the container is running:If it’s not running, start Milvus using Docker Compose:For Zilliz Cloud, verify your URI and token are correct in your
.env file. The URI should look like: https://your-cluster.api.gcp-us-west1.zillizcloud.comDimension mismatch error when inserting embeddings
Dimension mismatch error when inserting embeddings
This error occurs when the embedding dimension doesn’t match your collection schema. Ensure:
- The
dimparameter in your collection schema matches your embedding model’s output dimension - You’re using the correct embedding model consistently throughout your application
- If you need to change embedding models, create a new collection with the correct dimension
- OpenAI text-embedding-3-small: 1536
- OpenAI text-embedding-3-large: 3072 (or 1024 with dimension parameter)
- Cohere embed-english-v3.0: 1024
- Voyage AI voyage-2: 1024
Slow search performance
Slow search performance
To improve search performance:
-
Choose the right index: Use HNSW for best performance on large datasets:
-
Adjust search parameters: Increase
effor HNSW ornprobefor IVF indexes: -
Ensure collection is loaded: Always call
collection.load()before searching -
Use appropriate nlist: For IVF indexes, set
nlistto sqrt(num_entities) as a starting point - Consider GPU acceleration: Milvus supports GPU indexes for even faster search on large datasets
Out of memory errors
Out of memory errors
If you encounter memory issues:
-
Batch your insertions: Insert documents in batches of 1000-10000 instead of all at once:
-
Use memory-efficient indexes: IVF_SQ8 uses less memory than IVF_FLAT:
- Adjust Docker memory limits: If running locally, increase Docker’s memory allocation in Docker Desktop settings
- Consider Zilliz Cloud: Managed service with automatic scaling and memory management
-
Release collections: Release collections from memory when not in use:
Why am I getting empty search results?
Why am I getting empty search results?
Empty search results can occur due to:
-
Collection not loaded: Ensure you call
collection.load()after creating the index -
Wrong metric type: If you used
IP(inner product) for indexing butL2for searching, results may be incorrect. Keep them consistent: - Embedding mismatch: Ensure you’re using the same embedding model for both indexing and querying
-
Collection is empty: Verify data was inserted successfully:
-
Search threshold too strict: Try increasing
limitparameter or adjusting distance thresholds

