Change Log
- Deprecation notice: The
llama3.1-70b
model will be automatically upgraded tollama-3.3-70b
. Any existing references tollama3.1-70b
in your code will continue to work during a short-term aliasing period. However, we strongly encourage you to update your references to thellama-3.3-70b
model as soon as possible, since the aliasing will not be maintained indefinitely.
Support for Llama 3.3 70B
- We now support
llama-3.3-70b
, Meta’s newly released model that delivers enhanced performance across popular benchmarks for use cases including chat, coding, instruction following, mathematics, and reasoning. We serve this model at a speed of 2100+ tokens per second.
- Support for
completions
endpoint.
- Performance Upgrade: This release introduces speculative decoding, a technique that uses both a small model and a large model together to generate responses more quickly. Llama 3.1 70B now achieves an average output speed of 2,100 tokens per second. Please note that with speculative decoding, output speeds may fluctuate by up to 20% compared to the average.
Continued Performance Improvements
We currently serve Llama-3.1-8B at ~2000 tokens/sec, and Llama-3.1-70B at ~560 tokens/second.
Integration with AutoGen
Developers can now use the Cerebras Inference API with Microsoft AutoGen, an open-source framework for building AI agents. AutoGen streamlines the creation of advanced LLM applications by managing multi-agent conversations and optimizing workflows. With this integration, users can leverage features like tool use and parallel tool calling, while benefiting from Cerebras’ fast inference with Llama 3.1 8B and 70B models. For an example, visit the documentation page here.
Other Updates:
-
Users can now sign in to the developer playground using a magic link, without the need for setting up and remembering a password.
-
The
max_tokens
parameters has been renamed tomax_completion_tokens
, to maintain consistency with OpenAI’s syntax. -
We have updated our documentation to include a list of our available integrations for the Cerebras Inference SDK.
Was this page helpful?