Get Started with LangChain - Cerebras Inference

LangChain is a framework for developing applications powered by large language models (LLMs). It provides a standard interface for chains, lots of integrations with other tools, and end-to-end chains for common applications. By combining Cerebras’s ultra-fast inference with LangChain’s powerful orchestration capabilities, you can build production-ready AI applications with unprecedented speed and flexibility.

Prerequisites

Before you begin, ensure you have:

Cerebras API Key - Get a free API key here
Python 3.11 or higher - LangChain requires Python 3.11 or higher
Basic familiarity with LangChain - Visit LangChain documentation to learn more

Configure LangChain with Cerebras

Install required dependencies

Install the LangChain Cerebras integration package. This package provides native LangChain integration for Cerebras models, including chat models and embeddings.

Dependency resolution: If you encounter dependency conflicts during installation, try running the install command twice. The first run may update core dependencies, and the second run will resolve any remaining conflicts. This is a known behavior with some package managers when updating to newer versions of langchain-core.

Python:

pip install langchain-cerebras langchain

JavaScript:

npm install @langchain/core @langchain/community openai

Configure environment variables

Create a .env file in your project directory to securely store your API key. This keeps your credentials separate from your code.

CEREBRAS_API_KEY=your-cerebras-api-key-here

Alternatively, you can set the environment variable in your shell:

export CEREBRAS_API_KEY="your-cerebras-api-key-here"

Initialize the Cerebras chat model

Import and initialize the Cerebras chat model. The ChatCerebras class provides a LangChain-compatible interface that automatically handles connection to Cerebras Cloud and includes proper tracking headers.

from langchain_cerebras import ChatCerebras
import os

# Initialize the Cerebras chat model
llm = ChatCerebras(
    model="llama-3.3-70b",
    api_key=os.getenv("CEREBRAS_API_KEY"),
    temperature=0.7,
    max_tokens=1024,
)

Make your first request

Now you can use the model just like any other LangChain chat model. This example demonstrates basic message handling with system and user messages.

from langchain_cerebras import ChatCerebras
from langchain_core.messages import HumanMessage, SystemMessage
import os

# Initialize the model
llm = ChatCerebras(
    model="llama-3.3-70b",
    api_key=os.getenv("CEREBRAS_API_KEY"),
)

# Create messages
messages = [
    SystemMessage(content="You are a helpful AI assistant."),
    HumanMessage(content="What are the key benefits of using Cerebras for AI inference?")
]

# Get response
response = llm.invoke(messages)
print(response.content)

Use with LangChain chains

LangChain’s real power comes from chaining operations together. This example uses LCEL (LangChain Expression Language) to create a composable translation chain.

from langchain_cerebras import ChatCerebras
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
import os

# Initialize components
llm = ChatCerebras(
    model="llama-3.3-70b",
    api_key=os.getenv("CEREBRAS_API_KEY"),
)

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant that translates {input_language} to {output_language}."),
    ("human", "{text}")
])

output_parser = StrOutputParser()

# Create chain using LCEL
chain = prompt | llm | output_parser

# Use the chain
result = chain.invoke({
    "input_language": "English",
    "output_language": "French",
    "text": "Hello, how are you?"
})

print(result)

Enable streaming responses

Cerebras models support streaming, which is perfect for real-time applications. Streaming allows you to display responses as they’re generated, providing a better user experience.

from langchain_cerebras import ChatCerebras
from langchain_core.messages import HumanMessage
import os

# Initialize with streaming enabled
llm = ChatCerebras(
    model="llama-3.3-70b",
    api_key=os.getenv("CEREBRAS_API_KEY"),
    streaming=True,
)

# Stream the response
for chunk in llm.stream([HumanMessage(content="Write a short poem about AI")]):
    print(chunk.content, end="", flush=True)

Advanced Usage

Using Different Models

Cerebras supports multiple high-performance models. Choose the right model based on your use case:

from langchain_cerebras import ChatCerebras
import os

# Use Llama 3.3 70B for complex reasoning tasks
llama_70b = ChatCerebras(
    model="llama-3.3-70b",
    api_key=os.getenv("CEREBRAS_API_KEY"),
)

# Use Llama 3.1 8B for faster, lighter tasks
llama_8b = ChatCerebras(
    model="llama3.1-8b",
    api_key=os.getenv("CEREBRAS_API_KEY"),
)

# Use Qwen 3 32B for balanced performance
qwen_32b = ChatCerebras(
    model="qwen-3-32b",
    api_key=os.getenv("CEREBRAS_API_KEY"),
)

# Use GPT-OSS 120B for large-scale tasks
gpt_oss = ChatCerebras(
    model="gpt-oss-120b",
    api_key=os.getenv("CEREBRAS_API_KEY"),
)

Building a RAG Application

Here’s a complete example of building a Retrieval-Augmented Generation (RAG) application with Cerebras and LangChain:

from langchain_cerebras import ChatCerebras
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
import os

# Initialize the model
llm = ChatCerebras(
    model="llama-3.3-70b",
    api_key=os.getenv("CEREBRAS_API_KEY"),
)

# Create a RAG prompt template
template = """Answer the question based only on the following context:
{context}

Question: {question}

Answer:"""

prompt = ChatPromptTemplate.from_template(template)

# Create the RAG chain
rag_chain = (
    {"context": RunnablePassthrough(), "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# Use the chain
context = "Cerebras has developed the world's largest and fastest AI processor, the Wafer-Scale Engine-3 (WSE-3)."
question = "What has Cerebras developed?"

answer = rag_chain.invoke({"context": context, "question": question})
print(answer)

Async Operations

For high-throughput applications, use async operations to handle multiple requests concurrently:

import asyncio
from langchain_cerebras import ChatCerebras
from langchain_core.messages import HumanMessage
import os

async def get_response():
    llm = ChatCerebras(
        model="llama-3.3-70b",
        api_key=os.getenv("CEREBRAS_API_KEY"),
    )
    
    response = await llm.ainvoke([HumanMessage(content="Hello!")])
    return response.content

# Run async function
result = asyncio.run(get_response())
print(result)

Using with LangChain Agents

Cerebras models work seamlessly with LangChain agents for building autonomous AI systems:

from langchain_cerebras import ChatCerebras
from langchain.agents import create_agent
from langchain_core.tools import tool
import os

# Initialize the model
llm = ChatCerebras(
    model="llama-3.3-70b",
    api_key=os.getenv("CEREBRAS_API_KEY"),
)

# Define tools using the @tool decorator
@tool
def get_word_length(word: str) -> int:
    """Returns the length of a word."""
    return len(word)

tools = [get_word_length]

# Create agent using the simplified create_agent API (LangChain 1.0+)
agent = create_agent(
    llm,
    tools,
    system_prompt="You are a helpful assistant."
)

# Run agent
result = agent.invoke({"messages": [{"role": "user", "content": "How many letters are in the word 'Cerebras'?"}]})
print(result["messages"][-1].content)

Using OpenAI Client Directly

If you prefer to use the OpenAI client directly instead of the LangChain integration, you can configure it to work with Cerebras:

from openai import OpenAI
import os

client = OpenAI(
    api_key=os.getenv("CEREBRAS_API_KEY"),
    base_url="https://api.cerebras.ai/v1"
)

response = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"}
    ],
    max_tokens=500,
)

print(response.choices[0].message.content)

Troubleshooting

Why am I getting authentication errors?

Make sure your CEREBRAS_API_KEY environment variable is set correctly. You can verify it’s loaded by running:

import os
print(os.getenv("CEREBRAS_API_KEY"))

If it returns None, your environment variable isn’t set. Try setting it directly in your code for testing:

from langchain_cerebras import ChatCerebras

llm = ChatCerebras(
    model="llama-3.3-70b",
    api_key="your-api-key-here",
)

How do I handle rate limits?

Cerebras Cloud has generous rate limits, but if you’re making many concurrent requests, consider:

Using async operations with controlled concurrency
Implementing retry logic with exponential backoff
Batching requests when possible

Example with retry logic:

from langchain_cerebras import ChatCerebras
from tenacity import retry, stop_after_attempt, wait_exponential
import os

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
def get_completion(prompt):
    llm = ChatCerebras(
        model="llama-3.3-70b",
        api_key=os.getenv("CEREBRAS_API_KEY"),
    )
    return llm.invoke(prompt)

What's the difference between ChatCerebras and using OpenAI client directly?

ChatCerebras is a native LangChain integration that:

Provides a consistent interface with other LangChain chat models
Automatically handles message formatting and parsing
Supports all LangChain features like callbacks, streaming, and async
Includes proper integration tracking headers
Works seamlessly with LangChain chains and agents

If you’re building with LangChain, use ChatCerebras. If you need direct API access, use the OpenAI client with Cerebras base URL.

Can I use Cerebras with LangSmith for tracing?

Yes! LangSmith provides powerful debugging and monitoring capabilities for LangChain applications.

import os
from langchain_cerebras import ChatCerebras

# Enable LangSmith tracing
os.environ["LANGCHAIN_TRACING_V2"] = "true"

llm = ChatCerebras(
    model="llama-3.3-70b",
    api_key=os.getenv("CEREBRAS_API_KEY"),
)

# All calls will now be traced in LangSmith
response = llm.invoke("Hello!")
print(response.content)

Visit LangSmith to view your traces and debug your applications.

Which Cerebras model should I use?

Choose based on your use case:

llama-3.3-70b: Best for complex reasoning, long-form content, and tasks requiring deep understanding
qwen-3-32b: Balanced performance for general-purpose applications
llama3.1-8b: Fastest option for simple tasks and high-throughput scenarios
gpt-oss-120b: Largest model for the most demanding tasks
zai-glm-4.6: Advanced 357B parameter model with strong reasoning capabilities

All models run at blazing-fast speeds on Cerebras hardware. Learn more about available models.

Next Steps

Explore LangChain Documentation - Visit the official LangChain docs to learn about chains, agents, and more
Try Different Cerebras Models - Experiment with our available models to find the best fit for your use case
Build Complex Chains - Combine multiple LangChain components to create sophisticated AI workflows
Explore LangSmith - Use LangSmith for debugging and monitoring your LangChain applications
Join the Community - Connect with other developers in the LangChain Discord
Read the API Reference - Check out our Chat Completions API documentation for detailed API information
Migrate to GLM4.6 - Ready to upgrade? Follow our migration guide to start using our latest model

Additional Resources

Cerebras API Reference - Detailed API documentation
LangChain Cerebras Provider - Official LangChain integration docs
Cerebras Models - Learn about available models and their capabilities
LangChain Cookbook - Example notebooks and recipes
LangChain Expression Language (LCEL) - Learn about building chains with LCEL

Get Started

Capabilities

Resources

Support

​Prerequisites

​Configure LangChain with Cerebras

​Advanced Usage

​Using Different Models

​Building a RAG Application

​Async Operations

​Using with LangChain Agents

​Using OpenAI Client Directly

​Troubleshooting

​Next Steps

​Additional Resources

Prerequisites

Configure LangChain with Cerebras

Advanced Usage

Using Different Models

Building a RAG Application

Async Operations

Using with LangChain Agents

Using OpenAI Client Directly

Troubleshooting

Next Steps

Additional Resources