Google recently unveiled DiffusionGemma, an innovative experimental open model that accelerates text generation by utilizing a diffusion methodology. This approach contrasts with the conventional token-by-token processes found in many large language models. With DiffusionGemma, you can see output speeds that are up to four times faster on dedicated GPUs. This experimental model enables the generation of entire blocks of text all at once rather than taking a sequential approach to word prediction.
Coupled with its ability to self-correct and format complex markdown in real time, DiffusionGemma holds promise for various applications. The model operates as a Mixture of Experts system, incorporating 26 billion parameters, but during inference, it activates only 3.8 billion parameters. This means it can efficiently run on high-end consumer GPUs with as little as 18GB of VRAM when optimized.
A notable feature of DiffusionGemma is its capacity to draft up to 256 tokens concurrently. It can also refine these tokens through multiple passes, which allows for comprehensive evaluations of the text blocks being produced. As a result, local AI workflows achieve significant speed improvements. For instance, DiffusionGemma reportedly generates over 1,000 tokens per second when paired with a single NVIDIA H100 and can manage more than 700 tokens per second using an NVIDIA GeForce RTX 5090.
In targeting researchers and developers, Google anticipates that DiffusionGemma will be particularly beneficial for developing tools that depend on reduced latency. This includes applications such as inline editing, code infilling, rapid iterations, and non-linear text generation.
Additionally, the model's bidirectional attention mechanism permits every token in a block to consider relationships with other tokens, which could prove advantageous in specialized fields such as creating mathematical graphs, analyzing amino acid sequences, and structured editing.
Despite these advantages, Google emphasizes that DiffusionGemma remains an experimental model. For instances requiring the highest output quality, the traditional Gemma 4 models may be more suitable. DiffusionGemma is tailored for developers working on interactive local AI systems where speed is prioritized.
It is essential to note that the speed capabilities of DiffusionGemma are most advantageous for local scenarios with low concurrency, while traditional autoregressive models may offer better efficiency in high-volume cloud deployments where requests can be managed at scale. You can access DiffusionGemma via Hugging Face, which supports a variety of tools and frameworks, providing developers with numerous options for integration. Official support for llama.cpp is anticipated soon.