Skip to main content
LangChain is a framework for developing applications powered by large language models (LLMs). It provides a standard interface for chains, lots of integrations with other tools, and end-to-end chains for common applications. By combining Cerebras’s ultra-fast inference with LangChain’s powerful orchestration capabilities, you can build production-ready AI applications with unprecedented speed and flexibility.

Prerequisites

Before you begin, ensure you have:
  • Cerebras API Key - Get a free API key here
  • Python 3.9 or higher - LangChain requires Python 3.9 or higher
  • Basic familiarity with LangChain - Visit LangChain documentation to learn more

Configure LangChain with Cerebras

1

Install required dependencies

Install the LangChain Cerebras integration package. This package provides native LangChain integration for Cerebras models, including chat models and embeddings.
Dependency resolution: If you encounter dependency conflicts during installation, try running the install command twice. The first run may update core dependencies, and the second run will resolve any remaining conflicts. This is a known behavior with some package managers when updating to newer versions of langchain-core.
pip install langchain-cerebras langchain
2

Configure environment variables

Create a .env file in your project directory to securely store your API key. This keeps your credentials separate from your code.
CEREBRAS_API_KEY=your-cerebras-api-key-here
Alternatively, you can set the environment variable in your shell:
export CEREBRAS_API_KEY="your-cerebras-api-key-here"
3

Initialize the Cerebras chat model

Import and initialize the Cerebras chat model. The ChatCerebras class provides a LangChain-compatible interface that automatically handles connection to Cerebras Cloud and includes proper tracking headers.
from langchain_cerebras import ChatCerebras
import os

# Initialize the Cerebras chat model
llm = ChatCerebras(
    model="llama-3.3-70b",
    api_key=os.getenv("CEREBRAS_API_KEY"),
    temperature=0.7,
    max_tokens=1024,
)
4

Make your first request

Now you can use the model just like any other LangChain chat model. This example demonstrates basic message handling with system and user messages.
from langchain_cerebras import ChatCerebras
from langchain_core.messages import HumanMessage, SystemMessage
import os

# Initialize the model
llm = ChatCerebras(
    model="llama-3.3-70b",
    api_key=os.getenv("CEREBRAS_API_KEY"),
)

# Create messages
messages = [
    SystemMessage(content="You are a helpful AI assistant."),
    HumanMessage(content="What are the key benefits of using Cerebras for AI inference?")
]

# Get response
response = llm.invoke(messages)
print(response.content)
5

Use with LangChain chains

LangChain’s real power comes from chaining operations together. This example uses LCEL (LangChain Expression Language) to create a composable translation chain.
from langchain_cerebras import ChatCerebras
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
import os

# Initialize components
llm = ChatCerebras(
    model="llama-3.3-70b",
    api_key=os.getenv("CEREBRAS_API_KEY"),
)

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant that translates {input_language} to {output_language}."),
    ("human", "{text}")
])

output_parser = StrOutputParser()

# Create chain using LCEL
chain = prompt | llm | output_parser

# Use the chain
result = chain.invoke({
    "input_language": "English",
    "output_language": "French",
    "text": "Hello, how are you?"
})

print(result)
6

Enable streaming responses

Cerebras models support streaming, which is perfect for real-time applications. Streaming allows you to display responses as they’re generated, providing a better user experience.
from langchain_cerebras import ChatCerebras
from langchain_core.messages import HumanMessage
import os

# Initialize with streaming enabled
llm = ChatCerebras(
    model="llama-3.3-70b",
    api_key=os.getenv("CEREBRAS_API_KEY"),
    streaming=True,
)

# Stream the response
for chunk in llm.stream([HumanMessage(content="Write a short poem about AI")]):
    print(chunk.content, end="", flush=True)

Advanced Usage

Using Different Models

Cerebras supports multiple high-performance models. Choose the right model based on your use case:
from langchain_cerebras import ChatCerebras
import os

# Use Llama 3.3 70B for complex reasoning tasks
llama_70b = ChatCerebras(
    model="llama-3.3-70b",
    api_key=os.getenv("CEREBRAS_API_KEY"),
)

# Use Llama 3.1 8B for faster, lighter tasks
llama_8b = ChatCerebras(
    model="llama3.1-8b",
    api_key=os.getenv("CEREBRAS_API_KEY"),
)

# Use Qwen 3 32B for balanced performance
qwen_32b = ChatCerebras(
    model="qwen-3-32b",
    api_key=os.getenv("CEREBRAS_API_KEY"),
)

# Use GPT-OSS 120B for large-scale tasks
gpt_oss = ChatCerebras(
    model="gpt-oss-120b",
    api_key=os.getenv("CEREBRAS_API_KEY"),
)

Building a RAG Application

Here’s a complete example of building a Retrieval-Augmented Generation (RAG) application with Cerebras and LangChain:
from langchain_cerebras import ChatCerebras
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
import os

# Initialize the model
llm = ChatCerebras(
    model="llama-3.3-70b",
    api_key=os.getenv("CEREBRAS_API_KEY"),
)

# Create a RAG prompt template
template = """Answer the question based only on the following context:
{context}

Question: {question}

Answer:"""

prompt = ChatPromptTemplate.from_template(template)

# Create the RAG chain
rag_chain = (
    {"context": RunnablePassthrough(), "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# Use the chain
context = "Cerebras has developed the world's largest and fastest AI processor, the Wafer-Scale Engine-3 (WSE-3)."
question = "What has Cerebras developed?"

answer = rag_chain.invoke({"context": context, "question": question})
print(answer)

Async Operations

For high-throughput applications, use async operations to handle multiple requests concurrently:
import asyncio
from langchain_cerebras import ChatCerebras
from langchain_core.messages import HumanMessage
import os

async def get_response():
    llm = ChatCerebras(
        model="llama-3.3-70b",
        api_key=os.getenv("CEREBRAS_API_KEY"),
    )
    
    response = await llm.ainvoke([HumanMessage(content="Hello!")])
    return response.content

# Run async function
result = asyncio.run(get_response())
print(result)

Using with LangChain Agents

Cerebras models work seamlessly with LangChain agents for building autonomous AI systems:
from langchain_cerebras import ChatCerebras
from langchain.agents import AgentExecutor, create_react_agent
from langchain_core.tools import Tool
from langchain_core.prompts import PromptTemplate
import os

# Initialize the model
llm = ChatCerebras(
    model="llama-3.3-70b",
    api_key=os.getenv("CEREBRAS_API_KEY"),
)

# Define tools
def get_word_length(word: str) -> int:
    """Returns the length of a word."""
    return len(word)

tools = [
    Tool(
        name="GetWordLength",
        func=get_word_length,
        description="Returns the length of a word"
    )
]

# Define ReAct prompt directly (no hub needed)
react_prompt = PromptTemplate.from_template("""Answer the following questions as best you can. You have access to the following tools:

{tools}

Use the following format:

Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [{tool_names}]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question

Begin!

Question: {input}
Thought: {agent_scratchpad}""")

# Create agent
agent = create_react_agent(llm, tools, react_prompt)
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)

# Run agent
result = agent_executor.invoke({"input": "How many letters are in the word 'Cerebras'?"})
print(result)

Using OpenAI Client Directly

If you prefer to use the OpenAI client directly instead of the LangChain integration, you can configure it to work with Cerebras:
from openai import OpenAI
import os

client = OpenAI(
    api_key=os.getenv("CEREBRAS_API_KEY"),
    base_url="https://api.cerebras.ai/v1",
    default_headers={
        "X-Cerebras-3rd-Party-Integration": "langchain"
    }
)

response = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"}
    ],
    max_tokens=500,
)

print(response.choices[0].message.content)

Troubleshooting

Make sure your CEREBRAS_API_KEY environment variable is set correctly. You can verify it’s loaded by running:
import os
print(os.getenv("CEREBRAS_API_KEY"))
If it returns None, your environment variable isn’t set. Try setting it directly in your code for testing:
llm = ChatCerebras(
    model="llama-3.3-70b",
    api_key="your-api-key-here",
)
Cerebras Cloud has generous rate limits, but if you’re making many concurrent requests, consider:
  1. Using async operations with controlled concurrency
  2. Implementing retry logic with exponential backoff
  3. Batching requests when possible
Example with retry logic:
from langchain_cerebras import ChatCerebras
from tenacity import retry, stop_after_attempt, wait_exponential
import os

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
def get_completion(prompt):
    llm = ChatCerebras(
        model="llama-3.3-70b",
        api_key=os.getenv("CEREBRAS_API_KEY"),
    )
    return llm.invoke(prompt)
ChatCerebras is a native LangChain integration that:
  1. Provides a consistent interface with other LangChain chat models
  2. Automatically handles message formatting and parsing
  3. Supports all LangChain features like callbacks, streaming, and async
  4. Includes proper integration tracking headers
  5. Works seamlessly with LangChain chains and agents
If you’re building with LangChain, use ChatCerebras. If you need direct API access, use the OpenAI client with Cerebras base URL.
Yes! LangSmith provides powerful debugging and monitoring capabilities for LangChain applications. To enable LangSmith tracing:
import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your-langsmith-api-key"

from langchain_cerebras import ChatCerebras

llm = ChatCerebras(
    model="llama-3.3-70b",
    api_key=os.getenv("CEREBRAS_API_KEY"),
)

# All calls will now be traced in LangSmith
response = llm.invoke("Hello!")
Visit LangSmith to view your traces and debug your applications.
Choose based on your use case:
  • llama-3.3-70b: Best for complex reasoning, long-form content, and tasks requiring deep understanding
  • qwen-3-32b: Balanced performance for general-purpose applications
  • llama3.1-8b: Fastest option for simple tasks and high-throughput scenarios
  • gpt-oss-120b: Largest model for the most demanding tasks
All models run at blazing-fast speeds on Cerebras hardware. Learn more about available models.

Next Steps

  • Explore LangChain Documentation - Visit the official LangChain docs to learn about chains, agents, and more
  • Try Different Cerebras Models - Experiment with our available models to find the best fit for your use case
  • Build Complex Chains - Combine multiple LangChain components to create sophisticated AI workflows
  • Explore LangSmith - Use LangSmith for debugging and monitoring your LangChain applications
  • Join the Community - Connect with other developers in the LangChain Discord
  • Read the API Reference - Check out our Chat Completions API documentation for detailed API information

Additional Resources