This feature is in Private Preview. For access or more information, contact us or reach out to your account representative.
messages array.
Currently, image support is only available with
gemma-4-31b.Usage
To send an image, add animage_url object to the content array in a user message. The image must be base64-encoded and passed as a data URI.
- Single image
- Multiple images
Input Requirements
| Requirement | Details |
|---|---|
| Supported formats | PNG (.png), JPEG (.jpeg, .jpg) |
| Encoding | Base64 data URI (e.g., data:image/png;base64,...) |
| External image URLs | Not supported during Public Preview |
| Max payload size | 10 MB total image payload per request 1 |
| Max images per request | 5 1 |
1 These limits apply to the shared tier during Public Preview. Higher limits may be available for Dedicated Endpoints.
Token Usage
gemma-4-31b uses the default preprocessing setting of up to 280 image tokens per image. The model preserves image aspect ratio during preprocessing. Depending on the input dimensions, the image may be downscaled or upscaled before tokenization. The processed height and width are then rounded down to the nearest multiple of 48.
As a result, token usage depends on the processed image dimensions, not the uploaded file size or original resolution.
Estimate Token Count
Upload an image below to copy its base64 data URI, check the encoded size, and view a token estimate. You can also estimate the token count manually with the following steps:- Start with the input width and height.
-
Compute the scale factor:
-
Multiply the width and height by the scale factor.
- Round each processed dimension down to the nearest multiple of 48.
-
Compute the token count:
- Cap the result at 280.
336 × 226 image is upscaled during preprocessing to 960 × 624, which uses 260 image tokens.
| Input resolution | Processed resolution | Image tokens used |
|---|---|---|
| 336 × 226 | 960 × 624 | 260 |
| 512 × 512 | 768 × 768 | 256 |
| 672 × 672 | 768 × 768 | 256 |
| 1024 × 1024 | 768 × 768 | 256 |
| 1280 × 720 | 1056 × 576 | 264 |
| 1920 × 1080 | 1056 × 576 | 264 |
| 2560 × 1440 | 1056 × 576 | 264 |
| 3840 × 2160 | 1056 × 576 | 264 |
| 336 × 480 | 672 × 960 | 280 |
| 480 × 336 | 960 × 672 | 280 |
usage.prompt_tokens values. Cerebras does not currently return a separate image_tokens field.
Keep the following in mind:
- 280 is the maximum image tokens per image for
gemma-4-31bon Cerebras. - Compressed file size does not directly determine token count. Processed image dimensions matter more than PNG or JPEG byte size.
- Image tokens are added to text prompt tokens and reported together in
prompt_tokensin the API response. - Image tokens occupy part of the model context window, just like text prompt tokens.
Limitations
- Medical images — not suitable for interpreting specialized medical images such as CT scans or MRIs. Do not use for medical diagnosis or advice.
- Small text — may have difficulty reading small or low-resolution text. Enlarging text within the image before sending can improve results.
- Rotated content — may misinterpret text or images that are rotated or upside-down.
- Graphs and charts — may struggle to distinguish visual elements that differ only in color or line style, such as solid versus dashed lines.
- Spatial reasoning — not reliable for tasks requiring precise spatial localization, such as identifying positions on a map or board game.
- Object counting — the model may give approximate counts for objects in images.
- Image shape — may perform less accurately on panoramic or fisheye images.
- Preprocessing — the model cannot access original filenames or metadata. Images may be resized before analysis — see Token Usage for details.
- Accuracy — the model may generate inaccurate descriptions or captions in some scenarios. Verify outputs for high-stakes use cases.
- CAPTCHAs — CAPTCHA images are not supported.
- Indirect prompt injection — text embedded in an image is included in the model’s prompt context alongside the user’s text. If an image contains adversarial instructions (for example, text that says “ignore all previous instructions”) and the user prompt asks the model to answer based on the image, the model may follow those embedded instructions. Treat image content from untrusted sources as untrusted input, and use a system prompt to constrain the model’s behavior when processing images you don’t control.
- Untrusted output — the model may transcribe or describe text from an image verbatim, including HTML, script tags, URLs, or control characters. The API returns this content unmodified. Treat it the same as any other untrusted input before rendering, logging, or executing it in your application.
FAQs
Do I need to resend the image on later turns?
Do I need to resend the image on later turns?
Yes. Cerebras Chat Completions is stateless. If a follow-up request depends on an earlier image, include that image-bearing turn in the conversation history you send with the new request. Continue to include that turn for as long as the model needs the visual context.
Can I generate images?
Can I generate images?
No, only image input is supported. The model returns text only and does not generate images.
Is prompt caching supported with image inputs?
Is prompt caching supported with image inputs?
Yes. Prompt caching can help with repeated images and repeated multimodal context within your organization. Prompt caches are never shared between organizations and remain ephemeral. See Prompt Caching.
Do rate limits change with image support?
Do rate limits change with image support?
No. Image support uses the same rate limit framework as text. The same request and token limits still apply based on your organization and tier. For current details, see Rate Limits.
Do you store image data?
Do you store image data?
Image inputs are processed as soon as they are received, and the original image payloads are not persisted. After preprocessing, image tokens and image embeddings may be cached ephemerally within your organization to support prompt caching.

