Get Started
Overview
The Cerebras Inference API offers developers a low-latency solution for AI model inference powered by Cerebras Wafer-Scale Engines and CS-3 systems. We invite developers to explore the new possibilities that our high-speed inferencing solution unlocks.
The Cerebras Inference API currently provides access to the following models:
Model Name | Model ID | Parameters | Speed (tokens/s) |
---|---|---|---|
Llama 4 Scout | llama-4-scout-17b-16e-instruct | 109 billion | ~2600 tokens/s |
Llama 3.1 8B | llama3.1-8b | 8 billion | ~2200 tokens/s |
Llama 3.3 70B | llama-3.3-70b | 70 billion | ~2100 tokens/s |
Qwen 3 32B* | qwen-3-32b | 32 billion | ~2100 tokens/s |
DeepSeek R1 Distill Llama 70B* | deepseek-r1-distill-llama-70b | 70 billion | ~1700 tokens/s |
* Qwen 3 is a hybrid reasoning model that can operate in two modes: with or without thinking tokens. Currently, we only support (the default) reasoning mode. However, on queries where you do not want reasoning, you can suggest that the model does not reason by passing
/no_think
in the prompt. For example: "Write a python script to calculate the area of a circle /no_think"
.* DeepSeek R1 Distill Llama 70B is available in private preview. Please contact us to request access.
-
Play with our live chatbot demo.
-
For information on pricing and context length, visit our pricing page.
-
Experiment with our inference solution in the playground before making an API call.
-
Explore our API reference documentation.