> ## Documentation Index
> Fetch the complete documentation index at: https://inference-docs.cerebras.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Payload Optimization

> Reduce latency by compressing request payloads with msgpack encoding and gzip.

For requests with large payloads, you can reduce the size of the request body sent to the Cerebras API by using **[msgpack](https://msgpack.org/) encoding**, **[gzip](https://docs.python.org/3/library/gzip.html) compression**, or both. This can meaningfully improve time-to-first-token (TTFT) for requests with long prompts.

<Note>Payload optimization is supported on [/v1/chat/completions](/api-reference/chat-completions) and [/v1/completions](/api-reference/completions). Support on [Dedicated Endpoints](/dedicated/overview) may vary by model.</Note>

## Encoding Options

The Cerebras API accepts the following request body encodings. The table shows the expected payload size reduction for each option:

| Content-Type                                         | Description                | Chat Completions<sup>1</sup> | Completions<sup>2</sup> |
| ---------------------------------------------------- | -------------------------- | ---------------------------- | ----------------------- |
| `application/json`                                   | Default JSON encoding      | Baseline                     | Baseline                |
| `application/vnd.msgpack`                            | msgpack binary encoding    | up to \~5%                   | up to \~56%             |
| `application/json` + `Content-Encoding: gzip`        | JSON with gzip compression | up to \~98%                  | up to \~68%             |
| `application/vnd.msgpack` + `Content-Encoding: gzip` | msgpack + gzip             | up to \~98%                  | up to \~69%             |

<span style={{fontSize: "0.85em", color: "gray"}}><sup>1</sup> Measured against a 50k-token chat completions payload (206 KB JSON baseline).</span>

<br />

<span style={{fontSize: "0.85em", color: "gray"}}><sup>2</sup> Measured against a 50k token-ID completions payload (331 KB JSON baseline).</span>

You can use msgpack encoding or gzip compression independently, or combine them for maximum compression.

Smaller request payloads reduce network transfer time, a contributing factor to TTFT. Actual TTFT improvement will vary, as network transfer is one of several factors that contribute to overall latency.

## When to Use Payload Optimization

Optimizing payload size with request compression is most beneficial for:

* **Long prompts** – requests with long system prompts, extensive conversation history, or large code blocks. Gzip compression is the most effective option for these payloads.
* **Token-ID completions** – `/v1/completions` payloads using integer token arrays benefit from both msgpack encoding and gzip compression
* **Tool call-heavy payloads** – requests with many tool definitions or deeply nested JSON structures. Both msgpack encoding and gzip compression provide savings.

For small requests (under a few KB), the overhead of compression may outweigh the savings. Standard JSON encoding is fine for typical chat interactions.

<Info>
  Size reductions depend on payload content. The benchmarks above used a token-ID completions payload, where msgpack's integer encoding provides the greatest benefit. String-heavy chat payloads may see smaller msgpack reductions, while payloads with deeply nested structures (e.g., tool calls) may see greater savings. Gzip benefits are more consistent across payload types.
</Info>

## msgpack Encoding

[msgpack](https://msgpack.org/) is a binary serialization format that produces smaller payloads than equivalent JSON. To use it, serialize your request body with msgpack and set the `Content-Type` header to `application/vnd.msgpack`.

<CodeGroup>
  ```python Python theme={null}
  import msgpack
  import requests
  import os

  url = "https://api.cerebras.ai/v1/chat/completions"
  api_key = os.environ.get("CEREBRAS_API_KEY")

  payload = {
      "model": "gpt-oss-120b",
      "messages": [
          {"role": "user", "content": "Explain quantum computing in one paragraph."}
      ]
  }

  response = requests.post(
      url,
      data=msgpack.packb(payload),
      headers={
          "Content-Type": "application/vnd.msgpack",
          "Authorization": f"Bearer {api_key}"
      }
  )

  print(response.json()["choices"][0]["message"]["content"])
  ```

  ```javascript Node.js theme={null}
  const { encode } = require("@msgpack/msgpack");

  const apiKey = process.env.CEREBRAS_API_KEY;

  const payload = {
    model: "gpt-oss-120b",
    messages: [
      { role: "user", content: "Explain quantum computing in one paragraph." }
    ]
  };

  const response = await fetch("https://api.cerebras.ai/v1/chat/completions", {
    method: "POST",
    body: encode(payload),
    headers: {
      "Content-Type": "application/vnd.msgpack",
      "Authorization": `Bearer ${apiKey}`
    }
  });

  const data = await response.json();
  console.log(data.choices[0].message.content);
  ```

  ```bash cURL theme={null}
  # Create a msgpack-encoded payload
  python3 -c "
  import msgpack, sys
  payload = {
      'model': 'gpt-oss-120b',
      'messages': [{'role': 'user', 'content': 'Explain quantum computing in one paragraph.'}]
  }
  sys.stdout.buffer.write(msgpack.packb(payload))
  " > /tmp/request.msgpack

  curl https://api.cerebras.ai/v1/chat/completions \
    -H "Authorization: Bearer $CEREBRAS_API_KEY" \
    -H "Content-Type: application/vnd.msgpack" \
    --data-binary @/tmp/request.msgpack
  ```
</CodeGroup>

## Gzip Compression

You can gzip-compress any request body and set the `Content-Encoding: gzip` header. This works with both JSON and msgpack payloads.

<CodeGroup>
  ```python Python theme={null}
  import gzip
  import json
  import requests
  import os

  url = "https://api.cerebras.ai/v1/chat/completions"
  api_key = os.environ.get("CEREBRAS_API_KEY")

  payload = {
      "model": "gpt-oss-120b",
      "messages": [
          {"role": "user", "content": "Explain quantum computing in one paragraph."}
      ]
  }

  json_bytes = json.dumps(payload).encode("utf-8")
  compressed = gzip.compress(json_bytes, compresslevel=5)

  response = requests.post(
      url,
      data=compressed,
      headers={
          "Content-Type": "application/json",
          "Content-Encoding": "gzip",
          "Authorization": f"Bearer {api_key}"
      }
  )

  print(response.json()["choices"][0]["message"]["content"])
  ```

  ```javascript Node.js theme={null}
  const { gzipSync } = require("zlib");

  const apiKey = process.env.CEREBRAS_API_KEY;

  const payload = {
    model: "gpt-oss-120b",
    messages: [
      { role: "user", content: "Explain quantum computing in one paragraph." }
    ]
  };

  const jsonBytes = Buffer.from(JSON.stringify(payload));
  const compressed = gzipSync(jsonBytes, { level: 5 });

  const response = await fetch("https://api.cerebras.ai/v1/chat/completions", {
    method: "POST",
    body: compressed,
    headers: {
      "Content-Type": "application/json",
      "Content-Encoding": "gzip",
      "Authorization": `Bearer ${apiKey}`
    }
  });

  const data = await response.json();
  console.log(data.choices[0].message.content);
  ```

  ```bash cURL theme={null}
  # Compress the JSON payload with gzip
  echo '{"model": "gpt-oss-120b", "messages": [{"role": "user", "content": "Explain quantum computing in one paragraph."}]}' \
    | gzip > /tmp/request.json.gz

  curl https://api.cerebras.ai/v1/chat/completions \
    -H "Authorization: Bearer $CEREBRAS_API_KEY" \
    -H "Content-Type: application/json" \
    -H "Content-Encoding: gzip" \
    --data-binary @/tmp/request.json.gz
  ```
</CodeGroup>

## Combining Both

For maximum compression, use msgpack encoding with gzip compression:

<CodeGroup>
  ```python Python theme={null}
  import gzip
  import msgpack
  import requests
  import os

  url = "https://api.cerebras.ai/v1/chat/completions"
  api_key = os.environ.get("CEREBRAS_API_KEY")

  payload = {
      "model": "gpt-oss-120b",
      "messages": [
          {"role": "user", "content": "Explain quantum computing in one paragraph."}
      ]
  }

  data = msgpack.packb(payload)
  compressed = gzip.compress(data, compresslevel=5)

  response = requests.post(
      url,
      data=compressed,
      headers={
          "Content-Type": "application/vnd.msgpack",
          "Content-Encoding": "gzip",
          "Authorization": f"Bearer {api_key}"
      }
  )

  print(response.json()["choices"][0]["message"]["content"])
  ```

  ```javascript Node.js theme={null}
  const { encode } = require("@msgpack/msgpack");
  const { gzipSync } = require("zlib");

  const apiKey = process.env.CEREBRAS_API_KEY;

  const payload = {
    model: "gpt-oss-120b",
    messages: [
      { role: "user", content: "Explain quantum computing in one paragraph." }
    ]
  };

  const msgpackData = Buffer.from(encode(payload));
  const compressed = gzipSync(msgpackData, { level: 5 });

  const response = await fetch("https://api.cerebras.ai/v1/chat/completions", {
    method: "POST",
    body: compressed,
    headers: {
      "Content-Type": "application/vnd.msgpack",
      "Content-Encoding": "gzip",
      "Authorization": `Bearer ${apiKey}`
    }
  });

  const data = await response.json();
  console.log(data.choices[0].message.content);
  ```

  ```bash cURL theme={null}
  # Create a msgpack+gzip payload (requires Python)
  python3 -c "
  import msgpack, gzip, sys
  payload = {
      'model': 'gpt-oss-120b',
      'messages': [{'role': 'user', 'content': 'Explain quantum computing in one paragraph.'}]
  }
  sys.stdout.buffer.write(gzip.compress(msgpack.packb(payload), compresslevel=5))
  " > /tmp/request.msgpack.gz

  curl https://api.cerebras.ai/v1/chat/completions \
    -H "Authorization: Bearer $CEREBRAS_API_KEY" \
    -H "Content-Type: application/vnd.msgpack" \
    -H "Content-Encoding: gzip" \
    --data-binary @/tmp/request.msgpack.gz
  ```
</CodeGroup>