What is Unstructured?
Unstructured is an open-source library that helps you extract, transform, and prepare unstructured data from documents (PDFs, Word files, images, and more) for use with LLMs and other AI applications. It provides powerful partitioning, chunking, and staging capabilities to convert raw documents into structured, AI-ready data. By combining Unstructured’s document processing with Cerebras’s ultra-fast inference, you can build intelligent document analysis pipelines that extract insights, answer questions, and generate summaries from your documents at unprecedented speeds. Learn more at https://unstructured.io/.Prerequisites
Before you begin, ensure you have:- Cerebras API Key - Get a free API key here.
- Python 3.11 or higher - Unstructured requires Python 3.11+.
- Sample Documents - Have some PDFs, Word docs, or other files ready to process.
Installation and Setup
Install required dependencies
Install the Unstructured library along with the OpenAI SDK for Cerebras integration:For processing specific file types, you may need additional dependencies. To process PDFs with OCR:To install all available extras for maximum file type support:
Configure environment variables
Create a This keeps your credentials safe and makes it easy to manage different environments.
.env file in your project directory to store your API key securely:Basic Document Processing
Process a document with Unstructured
Use Unstructured to extract and partition content from your document. The The
partition function automatically detects the file type and extracts structured elements:partition function intelligently identifies different document elements like titles, paragraphs, tables, and lists, preserving the document’s structure.Advanced: Chunking for RAG Applications
For Retrieval-Augmented Generation (RAG) applications, you’ll want to chunk your documents into smaller, semantically meaningful pieces. This improves retrieval accuracy and ensures your context fits within model token limits.Chunk documents intelligently
Use Unstructured’s chunking capabilities to split documents while preserving context:The
chunk_by_title function creates semantically coherent chunks by keeping related content together based on document structure.Integrate with vector databases
For production RAG systems, combine Unstructured with vector databases for semantic search:Learn more about staging functions in the Unstructured documentation.
Complete Example: Document Analysis Pipeline
Here’s a complete example that processes a document, extracts key information, and generates insights:Use Cases
Document Summarization
Process lengthy reports, research papers, or legal documents and generate concise summaries:Information Extraction
Extract structured data from unstructured documents:Multi-Document Analysis
Compare and analyze multiple documents simultaneously:Supported File Types
Unstructured supports a wide variety of file formats:| Category | Formats |
|---|---|
| Documents | PDF, DOCX, DOC, ODT, RTF, TXT |
| Presentations | PPTX, PPT, ODP |
| Spreadsheets | XLSX, XLS, CSV, TSV |
| Web | HTML, XML, Markdown, EPUB |
| Images | JPG, PNG, TIFF (with OCR) |
| EML, MSG | |
| Code | Python, JavaScript, Java, and more |
Best Practices
Chunking Strategy
Choose the right chunking strategy based on your use case:chunk_by_title: Best for documents with clear hierarchical structure (reports, articles)- Fixed-size chunking: Good for uniform processing and consistent token counts
- Semantic chunking: Ideal for maintaining context in conversational or narrative documents
Token Management
Monitor token usage to optimize costs and performance:Error Handling
Implement robust error handling for production systems:Frequently Asked Questions
What file types does Unstructured support?
What file types does Unstructured support?
Unstructured supports over 20 file types including PDF, DOCX, PPTX, HTML, images (with OCR), and more. For the complete list, see the supported file types documentation.
How do I handle large documents that exceed token limits?
How do I handle large documents that exceed token limits?
Use Unstructured’s chunking capabilities to split large documents into smaller pieces:Process each chunk separately or use a map-reduce pattern to summarize chunks and then combine summaries.
Can I use Unstructured with other Cerebras models?
Can I use Unstructured with other Cerebras models?
Yes! You can use any Cerebras model. For document analysis, we recommend:
- cerebras/gpt-oss-120b: Best for complex analysis and reasoning
- cerebras/qwen-3-32b: Great balance of speed and capability
- cerebras/llama3.1-8b: Fastest option for simple extraction tasks
model parameter in your API calls.How do I extract tables from documents?
How do I extract tables from documents?
Unstructured automatically identifies and extracts tables. You can access them as HTML:Learn more in the table extraction guide.
What's the difference between partitioning strategies?
What's the difference between partitioning strategies?
Unstructured offers three partitioning strategies:
- auto: Automatically selects the best strategy (recommended)
- fast: Faster processing with basic text extraction
- hi_res: High-resolution processing with better table and layout detection
How do I process documents with OCR?
How do I process documents with OCR?
Install the OCR dependencies and Unstructured will automatically use OCR for images and scanned PDFs:Then process as normal:Unstructured uses Tesseract OCR by default. You can also configure it to use other OCR engines.
Troubleshooting
Installation Issues
If you encounter errors during installation, try installing dependencies for specific file types:Memory Issues with Large Documents
For very large documents, use thefast strategy to reduce memory usage. Process documents in batches or use streaming approaches for very large files.
API Rate Limits
If processing many documents, implement rate limiting and error handling:Document Processing Errors
If a document fails to process, try different strategies:Next Steps
- Explore Unstructured’s staging functions for integration with vector databases
- Learn about chunking strategies for optimal RAG performance
- Try different Cerebras models for various document analysis tasks
- Build RAG applications using Unstructured and Cerebras
- Explore embedding options for semantic search
- Check out Unstructured’s integrations with popular ML frameworks

