Release Notes
- Support for
completions
endpoint. - Various bug fixes.
- Various bug fixes.
- Various bug fixes.
- Various bug fixes.
-
Performance Upgrade: This release introduces speculative decoding, a technique that uses both a small model and a large model together to generate responses more quickly. Llama 3.1 70B now achieves an average output speed of 2,100 tokens per second. Please note that with speculative decoding, output speeds may fluctuate by up to 20% compared to the average.
-
Various bug fixes.
- Various bug fixes.
- Various bug fixes.
Continued Performance Improvements
We currently serve Llama-3.1-8B at ~2000 tokens/sec, and Llama-3.1-70B at ~560 tokens/second.
Integration with AutoGen
Developers can now use the Cerebras Inference API with Microsoft AutoGen, an open-source framework for building AI agents. AutoGen streamlines the creation of advanced LLM applications by managing multi-agent conversations and optimizing workflows. With this integration, users can leverage features like tool use and parallel tool calling, while benefiting from Cerebras’ fast inference with Llama 3.1 8B and 70B models. For an example, visit the documentation page here.
Other Updates:
-
Users can now sign in to the developer playground using a magic link, without the need for setting up and remembering a password.
-
The
max_tokens
parameters has been renamed tomax_completion_tokens
, to maintain consistency with OpenAI’s syntax. -
We have updated our documentation to include a list of our available integrations for the Cerebras Inference SDK.
Was this page helpful?