Designing for Cerebras - Cerebras Inference

The speed of your inference provider determines what’s feasible to put in the critical path of a user interaction or application flow. At typical GPU inference speeds (1–5+ seconds per response), it’s common to work around latency with async job queues, streaming UIs, and client code that assumes one token per SSE event. When you switch from slow inference to Cerebras, those workarounds can become the source of the latency you were trying to avoid. For any architectural decision in your stack, it’s worth asking: was this made because LLM calls are slow? If yes, it’s worth reconsidering. This guide walks through the patterns most likely to need rethinking when you move to Cerebras. These patterns span the full stack, from frontend rendering and streaming behavior to backend request handling, agent orchestration, and voice pipelines. Not every pattern will apply to your use case, but they all represent real issues we’ve seen in production migrations.

Running Example

Each pattern is illustrated through a coding assistant app with a Python backend (assistant.py) and JavaScript frontend. The backend setup is shared across all patterns:

# assistant.py
from cerebras.cloud.sdk import Cerebras, AsyncCerebras

# Synchronous client — used for request/response patterns
client = Cerebras()

# Async client — used for concurrent and streaming patterns
async_client = AsyncCerebras()

SYSTEM_PROMPT = "You are an expert coding assistant. Be concise and precise."

Pattern 1: Don’t let your UI become the bottleneck

The GPU approach: When inference is slow, elaborate loading states are a feature. You might incorporate animated progress trees, character-level streaming effects, and detailed visualizations to give users something to look at while the model thinks. The Cerebras approach: At Cerebras speeds, if the frontend is doing significant work to render each streaming chunk, the interface may spend more time rendering than the model spends generating. The model finishes before the UI catches up. For the coding assistant, this surfaces in the editor’s streaming display. If the UI updates every time new tokens arrive, it can’t keep up with Cerebras token rates and the display visibly lags. A simple fix is to batch incoming tokens and update the screen on a short timer (e.g. every 50ms) instead of on every event:

// editor.js — buffered streaming renderer
// Re-renders on a fixed interval rather than on every chunk.
// At Cerebras token rates, per-chunk re-renders create unnecessary overhead.

function streamIntoEditor(responseStream) {
  let buffer = "";

  const flushInterval = setInterval(() => {
    if (buffer) {
      appendToEditor(buffer);
      buffer = "";
    }
  }, 50);

  responseStream.on("chunk", (text) => {
    buffer += text;
  });

  responseStream.on("done", () => {
    clearInterval(flushInterval);
    if (buffer) appendToEditor(buffer);
  });
}

Other UI and infrastructure patterns to audit when migrating to Cerebras:

Agent progress trees — detailed per-step visualization of tool calls and reasoning. At GPU speeds, this gives users something to follow while waiting for a response. At Cerebras speeds, steps complete so quickly that the UI flicker can be more disorienting than helpful. A simpler “working…” state that resolves to a final result is often better.
Per-event streaming animations — character-level fade-ins or cursor effects that fire on every SSE event. Beyond rendering overhead, these will behave unexpectedly on Cerebras because each event may carry several tokens, not one. See Pattern 3 for details.
Infrastructure overhead — code that runs between your app and the Cerebras API (authentication, logging, data formatting) adds up in ways that don’t matter at GPU speeds, but can become noticeable when inference is fast. If you’re not seeing the performance you expect, measure the full request cycle, not just the model call.

The general principle: design your UI and infrastructure to keep up with fast inference, not the other way around.

Pattern 2: Synchronous AI in the request path

The GPU approach: When an LLM call takes several seconds, you can’t afford to make the user (and the server) wait that long in a normal request/response cycle. So you put it on a job queue, hand the user a task ID, and poll until it’s done. That means standing up a worker, a queue, state tracking, and a polling or webhook layer — all to handle what’s fundamentally one function call. The Cerebras approach: Make the LLM call directly in the request handler and return the result synchronously, the same way you’d call a database or external API:

# assistant.py — ask()
# Answers short, focused coding questions synchronously.

def ask(question: str, code_context: str = "") -> str:
    messages = [{"role": "system", "content": SYSTEM_PROMPT}]

    if code_context:
        messages.append({
            "role": "user",
            "content": f"Given this code:\n\n```\n{code_context}\n```\n\n{question}"
        })
    else:
        messages.append({"role": "user", "content": question})

    response = client.chat.completions.create(
        model="zai-glm-4.7",
        messages=messages,
        max_completion_tokens=300
    )

    return response.choices[0].message.content

There are some exceptions to this, like very long outputs, batch workloads, or cases where you want to offload work from a latency-sensitive service. For those cases, a queue still makes sense. For everything else, try the synchronous path first.

Pattern 3: Use streaming selectively

The GPU approach: Streaming is often treated as a best practice for AI applications, but it exists to reduce time to first token (TTFT) — the delay between sending a request and seeing the first word of a response. When inference is slow, even a partial response appearing quickly feels better than a blank screen, so streaming becomes the default for all AI calls. The Cerebras approach: Stream selectively based on response length:

Short responses (under ~200 tokens): The complete response arrives so quickly that streaming offers little perceptual benefit. A synchronous call that returns in under a second often feels faster than watching a brief response stream in word by word.
Long responses (200+ tokens): Streaming still meaningfully improves the experience. The user sees content immediately rather than waiting for a full response to generate.

For the coding assistant, a one-line explanation of a variable doesn’t need to be streamed. A 100-line refactored function should. The generate() function handles both:

# assistant.py — generate()
# Routes between synchronous and streaming delivery based on expected output length.
# Streaming adds complexity — only use it when the output is long enough to benefit.

from typing import Generator

def generate(prompt, stream=False):
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": prompt}
    ]

    if not stream:
        # Short outputs: return synchronously
        response = client.chat.completions.create(
            model="zai-glm-4.7",
            messages=messages,
            max_completion_tokens=200
        )
        return response.choices[0].message.content

    else:
        # Long outputs: stream so the user can start reading immediately
        def streamed():
            stream = client.chat.completions.create(
                model="zai-glm-4.7",
                messages=messages,
                stream=True
            )
            for chunk in stream:
                text = chunk.choices[0].delta.content
                if text:
                    yield text

        return streamed()

Cerebras streams multiple tokens per SSE event because tokens are generated faster than most providers deliver them. Clients built for slow inference commonly assume one token per event. If your streaming handler processes tokens individually or measures progress by event count, it will behave incorrectly. Cerebras delivers ~200 evenly-spaced events per second regardless of token count, so each event may contain several tokens or none at all. Always process the full content of each chunk rather than using event frequency as a signal.

  // Broken: assumes one token per event
  source.onmessage = (e) => {
    const token = JSON.parse(e.data).choices[0].delta.content;
    appendToken(token); // may contain many tokens, not one
    tokenCount += 1;    // this count will be wrong
  };

  // Correct: handle the full chunk content
  source.onmessage = (e) => {
    const chunk = JSON.parse(e.data).choices[0].delta.content ?? "";
    appendText(chunk); // append whatever arrived, however many tokens
  };

Pattern 4: Multi-step agent loops in real time

The GPU approach: A coding task like “find every place where this function is called and add error handling” might need 5–10 LLM steps: reading files, planning changes, editing, and verifying. At GPU speeds that loop takes 30–60 seconds, so it has to run as a background job with a progress page. The Cerebras approach: That same loop finishes in 2–3 seconds. It can run synchronously in the request path and return results in real time:

# run_coding_agent()
# Runs a multi-step coding agent synchronously.
# Each tool call + LLM step completes fast enough to stay in the request path.

import json

def run_coding_agent(task, tool_handler):
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": task}
    ]

    while True:
        response = client.chat.completions.create(
            model="zai-glm-4.7",
            messages=messages,
            tools=TOOLS,
            tool_choice="auto"
        )

        choice = response.choices[0]
        messages.append(choice.message)

        if not choice.message.tool_calls:
            return choice.message.content

        for tool_call in choice.message.tool_calls:
            result = tool_handler(
                tool_call.function.name,
                json.loads(tool_call.function.arguments)
            )
            messages.append({
                "role": "tool",
                "tool_call_id": tool_call.id,
                "content": json.dumps(result)
            })

Because loops complete quickly on Cerebras, you can afford more steps without adding seconds to the response. For example, an agent that reads code, plans changes, and edits files now has room to also run tests and fix failures within a single request.

Pattern 5: Voice input

The GPU approach: Voice agents require elaborate latency compensation like filler words, “thinking” sounds, and pre-generated response fragments because the gap between a user finishing speaking and the agent responding is several seconds. The Cerebras approach: The LLM step shrinks to a small fraction of total pipeline latency, which is now dominated by speech-to-text and text-to-speech processing. The assistant can respond before the latency becomes perceptible. Here’s a single voice turn that handles a tool call and responds:

# assistant.py — voice_turn()
# Handles a single voice agent turn: transcribed input in, spoken response out.
# The LLM step (including any tool calls) completes fast enough that no
# latency compensation is needed before TTS begins.

async def voice_turn(transcript, conversation_history, tool_handler):
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        *conversation_history,
        {"role": "user", "content": transcript}
    ]

    tools = [
        {
            "type": "function",
            "function": {
                "name": "find_symbol",
                "description": "Find a function, class, or variable in the codebase and return its definition.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "name": {"type": "string", "description": "Symbol name to look up"}
                    },
                    "required": ["name"],
                    "additionalProperties": False
                }
            }
        }
    ]

    response = await async_client.chat.completions.create(
        model="zai-glm-4.7",
        messages=messages,
        tools=tools,
        tool_choice="auto",
        max_completion_tokens=150
    )

    choice = response.choices[0]

    if choice.message.tool_calls:
        tool_call = choice.message.tool_calls[0]
        result = await tool_handler(
            tool_call.function.name,
            json.loads(tool_call.function.arguments)
        )

        messages.append(choice.message)
        messages.append({
            "role": "tool",
            "tool_call_id": tool_call.id,
            "content": json.dumps(result)
        })

        final = await async_client.chat.completions.create(
            model="zai-glm-4.7",
            messages=messages,
            max_completion_tokens=150
        )
        return final.choices[0].message.content

    return choice.message.content

For production voice agents, stream the response and pipe the first sentence to TTS as soon as it completes, rather than waiting for the full response. This further reduces the perceived gap between the user speaking and the agent responding.

See the LiveKit Voice Agent and Real-Time Voice Translation cookbooks for complete working examples.

Conclusion

The patterns in this guide aren’t new techniques, but what changes with Cerebras is the tradeoff — approaches that were too slow to justify become practical, and optimizations built for slow inference can become liabilities. Use this guide to audit what you’ve already built and simplify where it makes sense. Not every pattern will apply, and some of the old approaches still have their place. But the default assumption that inference is the bottleneck has changed, and the simplest architecture is now often the fastest one.

​Running Example

​Pattern 1: Don’t let your UI become the bottleneck

​Pattern 2: Synchronous AI in the request path

​Pattern 3: Use streaming selectively

​Pattern 4: Multi-step agent loops in real time

​Pattern 5: Voice input

​Conclusion

Running Example

Pattern 1: Don’t let your UI become the bottleneck

Pattern 2: Synchronous AI in the request path

Pattern 3: Use streaming selectively

Pattern 4: Multi-step agent loops in real time

Pattern 5: Voice input

Conclusion