Skip to main content
This feature is in Private Preview. For access or more information, contact us or reach out to your account representative.
Vision-capable models can understand visual content alongside text — including objects, diagrams, screenshots, and any text that appears within an image (see Limitations for exceptions). Images are sent through the Chat Completions API as base64-encoded data URIs in the messages array.
Currently, image support is only available with gemma-4-31b.

Usage

To send an image, add an image_url object to the content array in a user message. The image must be base64-encoded and passed as a data URI.
Use the encoder in the Token Usage section to convert your image to a base64 data URI. It also shows the estimated token count and encoded payload size.
from cerebras.cloud.sdk import Cerebras
import os
import base64

client = Cerebras(api_key=os.environ.get("CEREBRAS_API_KEY"))

def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

base64_image = encode_image("screenshot.png")

response = client.chat.completions.create(
    model="gemma-4-31b",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe this image in one concise sentence."},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/png;base64,{base64_image}"
                    },
                },
            ],
        }
    ],
)

print(response.choices[0].message.content)

Input Requirements

RequirementDetails
Supported formatsPNG (.png), JPEG (.jpeg, .jpg)
EncodingBase64 data URI (e.g., data:image/png;base64,...)
External image URLsNot supported during Public Preview
Max payload size10 MB total image payload per request 1
Max images per request5 1
1 These limits apply to the shared tier during Public Preview. Higher limits may be available for Dedicated Endpoints.

Token Usage

gemma-4-31b uses the default preprocessing setting of up to 280 image tokens per image. The model preserves image aspect ratio during preprocessing. Depending on the input dimensions, the image may be downscaled or upscaled before tokenization. The processed height and width are then rounded down to the nearest multiple of 48. As a result, token usage depends on the processed image dimensions, not the uploaded file size or original resolution.

Estimate Token Count

Upload an image below to copy its base64 data URI, check the encoded size, and view a token estimate. You can also estimate the token count manually with the following steps:
  1. Start with the input width and height.
  2. Compute the scale factor:
    scale = sqrt(645120 / (width × height))
    
  3. Multiply the width and height by the scale factor.
    scaled_width = width × scale
    scaled_height = height × scale
    
  4. Round each processed dimension down to the nearest multiple of 48.
  5. Compute the token count:
    image_tokens = (processed_width / 48) × (processed_height / 48)
    
  6. Cap the result at 280.
This means smaller images do not always use fewer image tokens. For example, a 336 × 226 image is upscaled during preprocessing to 960 × 624, which uses 260 image tokens.
Input resolutionProcessed resolutionImage tokens used
336 × 226960 × 624260
512 × 512768 × 768256
672 × 672768 × 768256
1024 × 1024768 × 768256
1280 × 7201056 × 576264
1920 × 10801056 × 576264
2560 × 14401056 × 576264
3840 × 21601056 × 576264
336 × 480672 × 960280
480 × 336960 × 672280
To validate token usage for a specific request, send two otherwise identical requests — one with the image and one without — and compare the usage.prompt_tokens values. Cerebras does not currently return a separate image_tokens field. Keep the following in mind:
  • 280 is the maximum image tokens per image for gemma-4-31b on Cerebras.
  • Compressed file size does not directly determine token count. Processed image dimensions matter more than PNG or JPEG byte size.
  • Image tokens are added to text prompt tokens and reported together in prompt_tokens in the API response.
  • Image tokens occupy part of the model context window, just like text prompt tokens.

Limitations

  • Medical images — not suitable for interpreting specialized medical images such as CT scans or MRIs. Do not use for medical diagnosis or advice.
  • Small text — may have difficulty reading small or low-resolution text. Enlarging text within the image before sending can improve results.
  • Rotated content — may misinterpret text or images that are rotated or upside-down.
  • Graphs and charts — may struggle to distinguish visual elements that differ only in color or line style, such as solid versus dashed lines.
  • Spatial reasoning — not reliable for tasks requiring precise spatial localization, such as identifying positions on a map or board game.
  • Object counting — the model may give approximate counts for objects in images.
  • Image shape — may perform less accurately on panoramic or fisheye images.
  • Preprocessing — the model cannot access original filenames or metadata. Images may be resized before analysis — see Token Usage for details.
  • Accuracy — the model may generate inaccurate descriptions or captions in some scenarios. Verify outputs for high-stakes use cases.
  • CAPTCHAs — CAPTCHA images are not supported.
  • Indirect prompt injection — text embedded in an image is included in the model’s prompt context alongside the user’s text. If an image contains adversarial instructions (for example, text that says “ignore all previous instructions”) and the user prompt asks the model to answer based on the image, the model may follow those embedded instructions. Treat image content from untrusted sources as untrusted input, and use a system prompt to constrain the model’s behavior when processing images you don’t control.
  • Untrusted output — the model may transcribe or describe text from an image verbatim, including HTML, script tags, URLs, or control characters. The API returns this content unmodified. Treat it the same as any other untrusted input before rendering, logging, or executing it in your application.

FAQs

Yes. Cerebras Chat Completions is stateless. If a follow-up request depends on an earlier image, include that image-bearing turn in the conversation history you send with the new request. Continue to include that turn for as long as the model needs the visual context.
No, only image input is supported. The model returns text only and does not generate images.
Yes. Prompt caching can help with repeated images and repeated multimodal context within your organization. Prompt caches are never shared between organizations and remain ephemeral. See Prompt Caching.
No. Image support uses the same rate limit framework as text. The same request and token limits still apply based on your organization and tier. For current details, see Rate Limits.
Image inputs are processed as soon as they are received, and the original image payloads are not persisted. After preprocessing, image tokens and image embeddings may be cached ephemerally within your organization to support prompt caching.