Get Started
Build with Cerebras Inference
The Cerebras Inference API offers developers a low-latency solution for AI model inference powered by Cerebras Wafer-Scale Engines and CS-3 systems. We invite developers to explore the new possibilities that our high-speed inferencing solution unlocks.
To get started with a free API key, click here.
The Cerebras Inference API currently provides access to the following models:
Model Name | Model ID | Parameters | Speed (tokens/s) |
---|---|---|---|
Llama 4 Scout | llama-4-scout-17b-16e-instruct | 109 billion | ~2600 tokens/s |
Llama 3.1 8B | llama3.1-8b | 8 billion | ~2200 tokens/s |
Llama 3.3 70B | llama-3.3-70b | 70 billion | ~2100 tokens/s |
Qwen 3 32B | qwen-3-32b | 32 billion | ~2100 tokens/s |
DeepSeek R1 Distill Llama 70B* | deepseek-r1-distill-llama-70b | 70 billion | ~1700 tokens/s |
* DeepSeek R1 Distill Llama 70B is available in private preview. Please contact us to request access.
-
Play with our live chatbot demo.
-
For information on pricing and context length, visit our pricing page.
-
Experiment with our inference solution in the playground before making an API call.
-
Explore our API reference documentation.