Unstructured is an open-source library that helps you extract, transform, and prepare unstructured data from documents (PDFs, Word files, images, and more) for use with LLMs and other AI applications. It provides powerful partitioning, chunking, and staging capabilities to convert raw documents into structured, AI-ready data.By combining Unstructured’s document processing with Cerebras’s ultra-fast inference, you can build intelligent document analysis pipelines that extract insights, answer questions, and generate summaries from your documents at unprecedented speeds. Learn more at https://unstructured.io/.
Install the Unstructured library along with the OpenAI SDK for Cerebras integration:
pip install unstructured openai python-dotenv
For processing specific file types, you may need additional dependencies. To process PDFs with OCR:
pip install "unstructured[pdf]"
To install all available extras for maximum file type support:
pip install "unstructured[all-docs]"
2
Configure environment variables
Create a .env file in your project directory to store your API key securely:
CEREBRAS_API_KEY=your-cerebras-api-key-here
This keeps your credentials safe and makes it easy to manage different environments.
3
Initialize the Cerebras client
Set up the OpenAI-compatible client to connect to Cerebras Inference. This client will be used to send processed document content to Cerebras models for analysis:
Use Unstructured to extract and partition content from your document. The partition function automatically detects the file type and extracts structured elements:
import osimport requestsfrom unstructured.partition.pdf import partition_pdffrom dotenv import load_dotenvload_dotenv()# Download PDF from URLpdf_url = "https://www.visitissaquahwa.com/wp-content/uploads/2023/03/Issaquah-Trails-Map-202108041607087155.pdf"response = requests.get(pdf_url)# Save temporarily and processwith open("/tmp/temp_doc.pdf", "wb") as f: f.write(response.content)elements = partition_pdf(filename="/tmp/temp_doc.pdf")# Convert elements to textdocument_text = "\n\n".join([str(el) for el in elements])print(f"Extracted {len(elements)} elements from document")print(f"Total text length: {len(document_text)} characters")
The partition function intelligently identifies different document elements like titles, paragraphs, tables, and lists, preserving the document’s structure.
2
Analyze the document with Cerebras
Send the processed document content to a Cerebras model for analysis, summarization, or question answering:
import osimport requestsfrom openai import OpenAIfrom dotenv import load_dotenvfrom unstructured.partition.pdf import partition_pdfload_dotenv()# Initialize Cerebras clientclient = OpenAI( api_key=os.getenv("CEREBRAS_API_KEY"), base_url="https://api.cerebras.ai/v1", default_headers={ "X-Cerebras-3rd-Party-Integration": "Unstructured" })# Download PDF from URLpdf_url = "https://www.visitissaquahwa.com/wp-content/uploads/2023/03/Issaquah-Trails-Map-202108041607087155.pdf"response = requests.get(pdf_url)with open("/tmp/temp_doc.pdf", "wb") as f: f.write(response.content)elements = partition_pdf(filename="/tmp/temp_doc.pdf")document_text = "\n\n".join([str(el) for el in elements])response = client.chat.completions.create( model="gpt-oss-120b", messages=[ { "role": "system", "content": "You are a helpful assistant that analyzes documents and provides concise summaries." }, { "role": "user", "content": f"Please provide a comprehensive summary of this document:\n\n{document_text[:4000]}" } ], max_tokens=1000, temperature=0.7)summary = response.choices[0].message.contentprint("\nDocument Summary:")print(summary)
Cerebras’s fast inference means you get results in seconds, even for long documents.
For Retrieval-Augmented Generation (RAG) applications, you’ll want to chunk your documents into smaller, semantically meaningful pieces. This improves retrieval accuracy and ensures your context fits within model token limits.
1
Chunk documents intelligently
Use Unstructured’s chunking capabilities to split documents while preserving context:
import osimport requestsfrom unstructured.partition.pdf import partition_pdffrom unstructured.chunking.title import chunk_by_titlefrom dotenv import load_dotenvload_dotenv()# Download PDF from URLpdf_url = "https://www.visitissaquahwa.com/wp-content/uploads/2023/03/Issaquah-Trails-Map-202108041607087155.pdf"response = requests.get(pdf_url)with open("/tmp/temp_doc.pdf", "wb") as f: f.write(response.content)# Partition the documentelements = partition_pdf(filename="/tmp/temp_doc.pdf")# Chunk by title with a maximum chunk sizechunks = chunk_by_title( elements, max_characters=1000, combine_text_under_n_chars=200)print(f"Created {len(chunks)} chunks from document")# Display first chunk as exampleif chunks: print(f"\nFirst chunk preview:\n{str(chunks[0])[:200]}...")
The chunk_by_title function creates semantically coherent chunks by keeping related content together based on document structure.
2
Process chunks with Cerebras for Q&A
Use the chunked content to build a question-answering system:
import osimport requestsfrom openai import OpenAIfrom dotenv import load_dotenvfrom unstructured.partition.pdf import partition_pdffrom unstructured.chunking.title import chunk_by_titleload_dotenv()# Initialize Cerebras clientclient = OpenAI( api_key=os.getenv("CEREBRAS_API_KEY"), base_url="https://api.cerebras.ai/v1", default_headers={ "X-Cerebras-3rd-Party-Integration": "Unstructured" })# Download PDF from URLpdf_url = "https://www.visitissaquahwa.com/wp-content/uploads/2023/03/Issaquah-Trails-Map-202108041607087155.pdf"response = requests.get(pdf_url)with open("/tmp/temp_doc.pdf", "wb") as f: f.write(response.content)# Process and chunk documentelements = partition_pdf(filename="/tmp/temp_doc.pdf")chunks = chunk_by_title(elements, max_characters=1000)def answer_question(question: str, chunks: list) -> str: """Answer a question using document chunks and Cerebras.""" # Combine chunks into context context = "\n\n".join([str(chunk) for chunk in chunks[:5]]) response = client.chat.completions.create( model="gpt-oss-120b", messages=[ { "role": "system", "content": "You are a helpful assistant that answers questions based on the provided document context. Only use information from the context to answer." }, { "role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}" } ], max_tokens=500, temperature=0.3 ) return response.choices[0].message.content# Example usagequestion = "What trails are mentioned in this document?"answer = answer_question(question, chunks)print(f"\nQ: {question}")print(f"A: {answer}")
3
Integrate with vector databases
For production RAG systems, combine Unstructured with vector databases for semantic search:
import osimport requestsfrom unstructured.partition.pdf import partition_pdffrom unstructured.chunking.title import chunk_by_titlefrom unstructured.staging.base import elements_to_jsonfrom dotenv import load_dotenvload_dotenv()# Download PDF from URLpdf_url = "https://www.visitissaquahwa.com/wp-content/uploads/2023/03/Issaquah-Trails-Map-202108041607087155.pdf"response = requests.get(pdf_url)with open("/tmp/temp_doc.pdf", "wb") as f: f.write(response.content)# Process and chunk documentelements = partition_pdf(filename="/tmp/temp_doc.pdf")chunks = chunk_by_title(elements, max_characters=1000)# Convert elements to JSON format for vector database ingestionjson_elements = elements_to_json(chunks)# Each chunk can now be embedded and stored in your vector database# Example: Weaviate, Pinecone, Qdrant, etc.for i, chunk in enumerate(chunks): chunk_text = str(chunk) # Generate embeddings and store in vector DB # vector_db.add(text=chunk_text, metadata={"chunk_id": i}) print(f"Chunk {i}: {len(chunk_text)} characters")
Learn more about staging functions in the Unstructured documentation.
Unstructured supports over 20 file types including PDF, DOCX, PPTX, HTML, images (with OCR), and more. For the complete list, see the supported file types documentation.
How do I handle large documents that exceed token limits?
Use Unstructured’s chunking capabilities to split large documents into smaller pieces:
Process each chunk separately or use a map-reduce pattern to summarize chunks and then combine summaries.
Can I use Unstructured with other Cerebras models?
Yes! You can use any Cerebras model. For document analysis, we recommend:
cerebras/gpt-oss-120b: Best for complex analysis and reasoning
cerebras/llama3.1-8b: Fastest option for simple extraction tasks
Simply change the model parameter in your API calls.
How do I extract tables from documents?
Unstructured automatically identifies and extracts tables. You can access them as HTML:
import osimport requestsfrom unstructured.partition.pdf import partition_pdffrom dotenv import load_dotenvload_dotenv()# Download and process documentpdf_url = "https://www.visitissaquahwa.com/wp-content/uploads/2023/03/Issaquah-Trails-Map-202108041607087155.pdf"response = requests.get(pdf_url)with open("/tmp/temp_doc.pdf", "wb") as f: f.write(response.content)elements = partition_pdf(filename="/tmp/temp_doc.pdf")# Filter for table elementstables = [el for el in elements if el.category == "Table"]for table in tables: # Get table as HTML table_html = table.metadata.text_as_html print(table_html)
If you encounter errors during installation, try installing dependencies for specific file types:
# For PDF supportpip install "unstructured[pdf]"# For image processing with OCRpip install "unstructured[local-inference]"# For all featurespip install "unstructured[all-docs]"
On macOS, you may need to install system dependencies: