Skip to main content

What is Unstructured?

Unstructured is an open-source library that helps you extract, transform, and prepare unstructured data from documents (PDFs, Word files, images, and more) for use with LLMs and other AI applications. It provides powerful partitioning, chunking, and staging capabilities to convert raw documents into structured, AI-ready data. By combining Unstructured’s document processing with Cerebras’s ultra-fast inference, you can build intelligent document analysis pipelines that extract insights, answer questions, and generate summaries from your documents at unprecedented speeds. Learn more at https://unstructured.io/.

Prerequisites

Before you begin, ensure you have:
  • Cerebras API Key - Get a free API key here.
  • Python 3.11 or higher - Unstructured requires Python 3.11+.
  • Sample Documents - Have some PDFs, Word docs, or other files ready to process.

Installation and Setup

1

Install required dependencies

Install the Unstructured library along with the OpenAI SDK for Cerebras integration:
pip install unstructured openai python-dotenv
For processing specific file types, you may need additional dependencies. To process PDFs with OCR:
pip install "unstructured[pdf]"
To install all available extras for maximum file type support:
pip install "unstructured[all-docs]"
2

Configure environment variables

Create a .env file in your project directory to store your API key securely:
CEREBRAS_API_KEY=your-cerebras-api-key-here
This keeps your credentials safe and makes it easy to manage different environments.
3

Initialize the Cerebras client

Set up the OpenAI-compatible client to connect to Cerebras Inference. This client will be used to send processed document content to Cerebras models for analysis:
import os
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()

client = OpenAI(
    api_key=os.getenv("CEREBRAS_API_KEY"),
    base_url="https://api.cerebras.ai/v1",
    default_headers={
        "X-Cerebras-3rd-Party-Integration": "Unstructured IO"
    }
)

Basic Document Processing

1

Process a document with Unstructured

Use Unstructured to extract and partition content from your document. The partition function automatically detects the file type and extracts structured elements:
import os
import requests
from unstructured.partition.pdf import partition_pdf
from dotenv import load_dotenv

load_dotenv()

# Download PDF from URL
pdf_url = "https://www.visitissaquahwa.com/wp-content/uploads/2023/03/Issaquah-Trails-Map-202108041607087155.pdf"
response = requests.get(pdf_url)

# Save temporarily and process
with open("/tmp/temp_doc.pdf", "wb") as f:
    f.write(response.content)

elements = partition_pdf(filename="/tmp/temp_doc.pdf")

# Convert elements to text
document_text = "\n\n".join([str(el) for el in elements])

print(f"Extracted {len(elements)} elements from document")
print(f"Total text length: {len(document_text)} characters")
The partition function intelligently identifies different document elements like titles, paragraphs, tables, and lists, preserving the document’s structure.
2

Analyze the document with Cerebras

Send the processed document content to a Cerebras model for analysis, summarization, or question answering:
import os
import requests
from openai import OpenAI
from dotenv import load_dotenv
from unstructured.partition.pdf import partition_pdf

load_dotenv()

# Initialize Cerebras client
client = OpenAI(
    api_key=os.getenv("CEREBRAS_API_KEY"),
    base_url="https://api.cerebras.ai/v1",
    default_headers={
        "X-Cerebras-3rd-Party-Integration": "Unstructured IO"
    }
)

# Download PDF from URL
pdf_url = "https://www.visitissaquahwa.com/wp-content/uploads/2023/03/Issaquah-Trails-Map-202108041607087155.pdf"
response = requests.get(pdf_url)

with open("/tmp/temp_doc.pdf", "wb") as f:
    f.write(response.content)

elements = partition_pdf(filename="/tmp/temp_doc.pdf")
document_text = "\n\n".join([str(el) for el in elements])

response = client.chat.completions.create(
    model="gpt-oss-120b",
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant that analyzes documents and provides concise summaries."
        },
        {
            "role": "user",
            "content": f"Please provide a comprehensive summary of this document:\n\n{document_text[:4000]}"
        }
    ],
    max_tokens=1000,
    temperature=0.7
)

summary = response.choices[0].message.content
print("\nDocument Summary:")
print(summary)
Cerebras’s fast inference means you get results in seconds, even for long documents.

Advanced: Chunking for RAG Applications

For Retrieval-Augmented Generation (RAG) applications, you’ll want to chunk your documents into smaller, semantically meaningful pieces. This improves retrieval accuracy and ensures your context fits within model token limits.
1

Chunk documents intelligently

Use Unstructured’s chunking capabilities to split documents while preserving context:
import os
import requests
from unstructured.partition.pdf import partition_pdf
from unstructured.chunking.title import chunk_by_title
from dotenv import load_dotenv

load_dotenv()

# Download PDF from URL
pdf_url = "https://www.visitissaquahwa.com/wp-content/uploads/2023/03/Issaquah-Trails-Map-202108041607087155.pdf"
response = requests.get(pdf_url)

with open("/tmp/temp_doc.pdf", "wb") as f:
    f.write(response.content)

# Partition the document
elements = partition_pdf(filename="/tmp/temp_doc.pdf")

# Chunk by title with a maximum chunk size
chunks = chunk_by_title(
    elements,
    max_characters=1000,
    combine_text_under_n_chars=200
)

print(f"Created {len(chunks)} chunks from document")

# Display first chunk as example
if chunks:
    print(f"\nFirst chunk preview:\n{str(chunks[0])[:200]}...")
The chunk_by_title function creates semantically coherent chunks by keeping related content together based on document structure.
2

Process chunks with Cerebras for Q&A

Use the chunked content to build a question-answering system:
import os
import requests
from openai import OpenAI
from dotenv import load_dotenv
from unstructured.partition.pdf import partition_pdf
from unstructured.chunking.title import chunk_by_title

load_dotenv()

# Initialize Cerebras client
client = OpenAI(
    api_key=os.getenv("CEREBRAS_API_KEY"),
    base_url="https://api.cerebras.ai/v1",
    default_headers={
        "X-Cerebras-3rd-Party-Integration": "Unstructured IO"
    }
)

# Download PDF from URL
pdf_url = "https://www.visitissaquahwa.com/wp-content/uploads/2023/03/Issaquah-Trails-Map-202108041607087155.pdf"
response = requests.get(pdf_url)

with open("/tmp/temp_doc.pdf", "wb") as f:
    f.write(response.content)

# Process and chunk document
elements = partition_pdf(filename="/tmp/temp_doc.pdf")
chunks = chunk_by_title(elements, max_characters=1000)

def answer_question(question: str, chunks: list) -> str:
    """Answer a question using document chunks and Cerebras."""
    
    # Combine chunks into context
    context = "\n\n".join([str(chunk) for chunk in chunks[:5]])
    
    response = client.chat.completions.create(
        model="gpt-oss-120b",
        messages=[
            {
                "role": "system",
                "content": "You are a helpful assistant that answers questions based on the provided document context. Only use information from the context to answer."
            },
            {
                "role": "user",
                "content": f"Context:\n{context}\n\nQuestion: {question}"
            }
        ],
        max_tokens=500,
        temperature=0.3
    )
    
    return response.choices[0].message.content

# Example usage
question = "What trails are mentioned in this document?"
answer = answer_question(question, chunks)
print(f"\nQ: {question}")
print(f"A: {answer}")
3

Integrate with vector databases

For production RAG systems, combine Unstructured with vector databases for semantic search:
import os
import requests
from unstructured.partition.pdf import partition_pdf
from unstructured.chunking.title import chunk_by_title
from unstructured.staging.base import elements_to_json
from dotenv import load_dotenv

load_dotenv()

# Download PDF from URL
pdf_url = "https://www.visitissaquahwa.com/wp-content/uploads/2023/03/Issaquah-Trails-Map-202108041607087155.pdf"
response = requests.get(pdf_url)

with open("/tmp/temp_doc.pdf", "wb") as f:
    f.write(response.content)

# Process and chunk document
elements = partition_pdf(filename="/tmp/temp_doc.pdf")
chunks = chunk_by_title(elements, max_characters=1000)

# Convert elements to JSON format for vector database ingestion
json_elements = elements_to_json(chunks)

# Each chunk can now be embedded and stored in your vector database
# Example: Weaviate, Pinecone, Qdrant, etc.
for i, chunk in enumerate(chunks):
    chunk_text = str(chunk)
    # Generate embeddings and store in vector DB
    # vector_db.add(text=chunk_text, metadata={"chunk_id": i})
    print(f"Chunk {i}: {len(chunk_text)} characters")
Learn more about staging functions in the Unstructured documentation.

Complete Example: Document Analysis Pipeline

Here’s a complete example that processes a document, extracts key information, and generates insights:
import os
import requests
from openai import OpenAI
from dotenv import load_dotenv
from unstructured.partition.pdf import partition_pdf
from unstructured.chunking.title import chunk_by_title

load_dotenv()

client = OpenAI(
    api_key=os.getenv("CEREBRAS_API_KEY"),
    base_url="https://api.cerebras.ai/v1",
    default_headers={
        "X-Cerebras-3rd-Party-Integration": "Unstructured IO"
    }
)

def analyze_document(pdf_url: str):
    """Complete document analysis pipeline."""
    
    # Step 1: Download and extract document
    print(f"Processing {pdf_url}...")
    response = requests.get(pdf_url)
    
    with open("/tmp/temp_doc.pdf", "wb") as f:
        f.write(response.content)
    
    elements = partition_pdf(filename="/tmp/temp_doc.pdf")
    print(f"Extracted {len(elements)} elements")
    
    # Step 2: Create chunks for better processing
    chunks = chunk_by_title(
        elements,
        max_characters=1500,
        combine_text_under_n_chars=300
    )
    print(f"Created {len(chunks)} chunks")
    
    # Step 3: Generate summary
    full_text = "\n\n".join([str(el) for el in elements])
    
    summary_response = client.chat.completions.create(
        model="gpt-oss-120b",
        messages=[
            {
                "role": "system",
                "content": "You are an expert document analyst. Provide clear, concise summaries."
            },
            {
                "role": "user",
                "content": f"Summarize this document in 3-5 bullet points:\n\n{full_text[:4000]}"
            }
        ],
        max_tokens=500,
        temperature=0.5
    )
    
    print("\n=== Document Summary ===")
    print(summary_response.choices[0].message.content)
    
    # Step 4: Extract key entities
    entities_response = client.chat.completions.create(
        model="gpt-oss-120b",
        messages=[
            {
                "role": "system",
                "content": "Extract key entities (people, organizations, dates, locations) from documents."
            },
            {
                "role": "user",
                "content": f"List the key entities mentioned in this document:\n\n{full_text[:4000]}"
            }
        ],
        max_tokens=300,
        temperature=0.3
    )
    
    print("\n=== Key Entities ===")
    print(entities_response.choices[0].message.content)
    
    return {
        "elements": elements,
        "chunks": chunks,
        "summary": summary_response.choices[0].message.content,
        "entities": entities_response.choices[0].message.content
    }

# Run the analysis
if __name__ == "__main__":
    pdf_url = "https://www.visitissaquahwa.com/wp-content/uploads/2023/03/Issaquah-Trails-Map-202108041607087155.pdf"
    results = analyze_document(pdf_url)
    print("\nAnalysis complete!")

Use Cases

Document Summarization

Process lengthy reports, research papers, or legal documents and generate concise summaries:
import os
from openai import OpenAI
from dotenv import load_dotenv
from unstructured.partition.auto import partition

load_dotenv()

# Initialize Cerebras client
client = OpenAI(
    api_key=os.getenv("CEREBRAS_API_KEY"),
    base_url="https://api.cerebras.ai/v1",
    default_headers={
        "X-Cerebras-3rd-Party-Integration": "Unstructured IO"
    }
)

def summarize_document(file_path: str, summary_length: str = "medium") -> str:
    """Generate a summary of any document."""
    elements = partition(filename=file_path)
    text = "\n\n".join([str(el) for el in elements])
    
    length_instructions = {
        "short": "in 2-3 sentences",
        "medium": "in 3-5 bullet points",
        "long": "in 2-3 paragraphs with key details"
    }
    
    response = client.chat.completions.create(
        model="gpt-oss-120b",
        messages=[
            {"role": "system", "content": "You are an expert at summarizing documents."},
            {"role": "user", "content": f"Summarize this document {length_instructions[summary_length]}:\n\n{text[:5000]}"}
        ],
        max_tokens=800,
        temperature=0.5
    )
    
    return response.choices[0].message.content

# Example usage
# summary = summarize_document("path/to/document.pdf", "medium")
# print(summary)

Information Extraction

Extract structured data from unstructured documents:
import os
from openai import OpenAI
from dotenv import load_dotenv
from unstructured.partition.auto import partition

load_dotenv()

# Initialize Cerebras client
client = OpenAI(
    api_key=os.getenv("CEREBRAS_API_KEY"),
    base_url="https://api.cerebras.ai/v1",
    default_headers={
        "X-Cerebras-3rd-Party-Integration": "Unstructured IO"
    }
)

def extract_structured_data(file_path: str) -> dict:
    """Extract structured information from a document."""
    elements = partition(filename=file_path)
    text = "\n\n".join([str(el) for el in elements])
    
    response = client.chat.completions.create(
        model="gpt-oss-120b",
        messages=[
            {
                "role": "system",
                "content": "Extract structured information from documents. Return JSON format."
            },
            {
                "role": "user",
                "content": f"Extract key information (dates, names, amounts, locations) from this document as JSON:\n\n{text[:4000]}"
            }
        ],
        max_tokens=1000,
        temperature=0.2
    )
    
    return response.choices[0].message.content

# Example usage
# data = extract_structured_data("path/to/document.pdf")
# print(data)

Multi-Document Analysis

Compare and analyze multiple documents simultaneously:
import os
from openai import OpenAI
from dotenv import load_dotenv
from unstructured.partition.auto import partition

load_dotenv()

# Initialize Cerebras client
client = OpenAI(
    api_key=os.getenv("CEREBRAS_API_KEY"),
    base_url="https://api.cerebras.ai/v1",
    default_headers={
        "X-Cerebras-3rd-Party-Integration": "Unstructured IO"
    }
)

def compare_documents(file_paths: list) -> str:
    """Compare multiple documents and identify key differences."""
    documents = []
    
    for path in file_paths:
        elements = partition(filename=path)
        text = "\n\n".join([str(el) for el in elements])
        documents.append({"path": path, "text": text[:2000]})
    
    comparison_text = "\n\n---\n\n".join(
        [f"Document {i+1} ({doc['path']}):\n{doc['text']}" 
         for i, doc in enumerate(documents)]
    )
    
    response = client.chat.completions.create(
        model="gpt-oss-120b",
        messages=[
            {
                "role": "system",
                "content": "You are an expert at comparing documents and identifying key differences."
            },
            {
                "role": "user",
                "content": f"Compare these documents and highlight key differences:\n\n{comparison_text}"
            }
        ],
        max_tokens=1000,
        temperature=0.5
    )
    
    return response.choices[0].message.content

# Example usage
# comparison = compare_documents(["doc1.pdf", "doc2.pdf"])
# print(comparison)

Supported File Types

Unstructured supports a wide variety of file formats:
CategoryFormats
DocumentsPDF, DOCX, DOC, ODT, RTF, TXT
PresentationsPPTX, PPT, ODP
SpreadsheetsXLSX, XLS, CSV, TSV
WebHTML, XML, Markdown, EPUB
ImagesJPG, PNG, TIFF (with OCR)
EmailEML, MSG
CodePython, JavaScript, Java, and more
See the Unstructured documentation for the complete list of supported formats.

Best Practices

Chunking Strategy

Choose the right chunking strategy based on your use case:
  • chunk_by_title: Best for documents with clear hierarchical structure (reports, articles)
  • Fixed-size chunking: Good for uniform processing and consistent token counts
  • Semantic chunking: Ideal for maintaining context in conversational or narrative documents
import os
import requests
from unstructured.partition.pdf import partition_pdf
from unstructured.chunking.title import chunk_by_title
from unstructured.chunking.basic import chunk_elements
from dotenv import load_dotenv

load_dotenv()

# Download and process document
pdf_url = "https://www.visitissaquahwa.com/wp-content/uploads/2023/03/Issaquah-Trails-Map-202108041607087155.pdf"
response = requests.get(pdf_url)

with open("/tmp/temp_doc.pdf", "wb") as f:
    f.write(response.content)

elements = partition_pdf(filename="/tmp/temp_doc.pdf")

# Hierarchical chunking (recommended for most documents)
chunks = chunk_by_title(elements, max_characters=1000)

# Fixed-size chunking
chunks_fixed = chunk_elements(elements, max_characters=500)

Token Management

Monitor token usage to optimize costs and performance:
import os
from dotenv import load_dotenv

load_dotenv()

def estimate_tokens(text: str) -> int:
    """Rough estimate of token count (1 token ≈ 4 characters)."""
    return len(text) // 4

def process_with_token_limit(text: str, max_tokens: int = 4000):
    """Process text while respecting token limits."""
    estimated_tokens = estimate_tokens(text)
    
    if estimated_tokens > max_tokens:
        # Truncate or chunk the text
        char_limit = max_tokens * 4
        text = text[:char_limit]
        print(f"Text truncated to fit {max_tokens} token limit")
    
    return text

# Example usage
# sample_text = "Your document text here..."
# processed = process_with_token_limit(sample_text, 4000)

Error Handling

Implement robust error handling for production systems:
import os
import time
from openai import OpenAI, RateLimitError, APIError
from dotenv import load_dotenv

load_dotenv()

# Initialize Cerebras client
client = OpenAI(
    api_key=os.getenv("CEREBRAS_API_KEY"),
    base_url="https://api.cerebras.ai/v1",
    default_headers={
        "X-Cerebras-3rd-Party-Integration": "Unstructured IO"
    }
)

def process_with_retry(text: str, max_retries: int = 3):
    """Process text with automatic retry logic."""
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model="gpt-oss-120b",
                messages=[{"role": "user", "content": text}],
                max_tokens=1000
            )
            return response.choices[0].message.content
        except RateLimitError:
            if attempt < max_retries - 1:
                wait_time = 2 ** attempt
                print(f"Rate limit hit. Waiting {wait_time}s...")
                time.sleep(wait_time)
            else:
                raise
        except APIError as e:
            print(f"API error: {e}")
            if attempt < max_retries - 1:
                time.sleep(1)
            else:
                raise

Frequently Asked Questions

Unstructured supports over 20 file types including PDF, DOCX, PPTX, HTML, images (with OCR), and more. For the complete list, see the supported file types documentation.
Use Unstructured’s chunking capabilities to split large documents into smaller pieces:
import os
import requests
from unstructured.partition.pdf import partition_pdf
from unstructured.chunking.title import chunk_by_title
from dotenv import load_dotenv

load_dotenv()

# Download and process document
pdf_url = "https://www.visitissaquahwa.com/wp-content/uploads/2023/03/Issaquah-Trails-Map-202108041607087155.pdf"
response = requests.get(pdf_url)

with open("/tmp/temp_doc.pdf", "wb") as f:
    f.write(response.content)

elements = partition_pdf(filename="/tmp/temp_doc.pdf")

chunks = chunk_by_title(
    elements,
    max_characters=1000,
    combine_text_under_n_chars=200
)
Process each chunk separately or use a map-reduce pattern to summarize chunks and then combine summaries.
Yes! You can use any Cerebras model. For document analysis, we recommend:
  • cerebras/gpt-oss-120b: Best for complex analysis and reasoning
  • cerebras/qwen-3-32b: Great balance of speed and capability
  • cerebras/llama3.1-8b: Fastest option for simple extraction tasks
Simply change the model parameter in your API calls.
Unstructured automatically identifies and extracts tables. You can access them as HTML:
import os
import requests
from unstructured.partition.pdf import partition_pdf
from dotenv import load_dotenv

load_dotenv()

# Download and process document
pdf_url = "https://www.visitissaquahwa.com/wp-content/uploads/2023/03/Issaquah-Trails-Map-202108041607087155.pdf"
response = requests.get(pdf_url)

with open("/tmp/temp_doc.pdf", "wb") as f:
    f.write(response.content)

elements = partition_pdf(filename="/tmp/temp_doc.pdf")

# Filter for table elements
tables = [el for el in elements if el.category == "Table"]

for table in tables:
    # Get table as HTML
    table_html = table.metadata.text_as_html
    print(table_html)
Learn more in the table extraction guide.
Unstructured offers three partitioning strategies:
  • auto: Automatically selects the best strategy (recommended)
  • fast: Faster processing with basic text extraction
  • hi_res: High-resolution processing with better table and layout detection
import os
import requests
from unstructured.partition.pdf import partition_pdf
from dotenv import load_dotenv

load_dotenv()

# Download and process document
pdf_url = "https://www.visitissaquahwa.com/wp-content/uploads/2023/03/Issaquah-Trails-Map-202108041607087155.pdf"
response = requests.get(pdf_url)

with open("/tmp/temp_doc.pdf", "wb") as f:
    f.write(response.content)

# High-resolution processing for complex layouts (requires poppler)
# Note: Use strategy="fast" if poppler is not installed
elements = partition_pdf(
    filename="/tmp/temp_doc.pdf",
    strategy="fast"
)
Install the OCR dependencies and Unstructured will automatically use OCR for images and scanned PDFs:
pip install "unstructured[local-inference]"
Then process as normal:
import os
import requests
from unstructured.partition.pdf import partition_pdf
from dotenv import load_dotenv

load_dotenv()

# Download and process document with OCR
pdf_url = "https://www.visitissaquahwa.com/wp-content/uploads/2023/03/Issaquah-Trails-Map-202108041607087155.pdf"
response = requests.get(pdf_url)

with open("/tmp/temp_doc.pdf", "wb") as f:
    f.write(response.content)

elements = partition_pdf(filename="/tmp/temp_doc.pdf")
Unstructured uses Tesseract OCR by default. You can also configure it to use other OCR engines.

Troubleshooting

Installation Issues

If you encounter errors during installation, try installing dependencies for specific file types:
# For PDF support
pip install "unstructured[pdf]"

# For image processing with OCR
pip install "unstructured[local-inference]"

# For all features
pip install "unstructured[all-docs]"
On macOS, you may need to install system dependencies:
brew install libmagic poppler tesseract

Memory Issues with Large Documents

For very large documents, use the fast strategy to reduce memory usage. Process documents in batches or use streaming approaches for very large files.

API Rate Limits

If processing many documents, implement rate limiting and error handling:
import os
import time
from openai import OpenAI, RateLimitError
from dotenv import load_dotenv

load_dotenv()

# Initialize Cerebras client
client = OpenAI(
    api_key=os.getenv("CEREBRAS_API_KEY"),
    base_url="https://api.cerebras.ai/v1",
    default_headers={
        "X-Cerebras-3rd-Party-Integration": "Unstructured IO"
    }
)

def process_with_retry(text: str, max_retries: int = 3):
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model="gpt-oss-120b",
                messages=[{"role": "user", "content": text}]
            )
            return response.choices[0].message.content
        except RateLimitError:
            if attempt < max_retries - 1:
                time.sleep(2 ** attempt)  # Exponential backoff
            else:
                raise

# Example usage
# result = process_with_retry("Your text here")
# print(result)

Document Processing Errors

If a document fails to process, try different strategies:
import os
import requests
from unstructured.partition.pdf import partition_pdf
from dotenv import load_dotenv

load_dotenv()

# Download document
pdf_url = "https://www.visitissaquahwa.com/wp-content/uploads/2023/03/Issaquah-Trails-Map-202108041607087155.pdf"
response = requests.get(pdf_url)

with open("/tmp/temp_doc.pdf", "wb") as f:
    f.write(response.content)

try:
    elements = partition_pdf(filename="/tmp/temp_doc.pdf", strategy="fast")
    print(f"Successfully processed with fast strategy: {len(elements)} elements")
except Exception as e:
    print(f"Processing failed: {e}")

Next Steps

Additional Resources