Google Unveils TurboQuant, a New AI Memory Compression Algorithm

Researchers described the technology as a novel way to shrink AI’s working memory without impacting performance.

MITSloan ME Editorial 8 minutes ago

Topics

Google LLC has introduced TurboQuant, a method to improve the efficiency of AI models by reducing their memory usage while maintaining performance. The TurboQuant algorithm has not been deployed widely yet and is currently a lab breakthrough.

While the technicalities and math involved here are better understood by the researchers, the entire tech industry is happy about the results. Google researchers Amir Zandieh and Vahab Mirrokni described the technology as a novel way to shrink AI’s working memory without impacting performance. Reducing that data footprint can significantly improve speed and lower operational costs, but existing compression techniques often introduce errors that degrade output quality.

The algorithm attempts to address both limitations simultaneously—achieving higher compression rates with fewer inaccuracies. It does so by modifying how data is represented mathematically within AI systems.

Modern AI models encode information as high-dimensional vectors, which can be understood as geometric structures capable of representing complex data such as text or equations. Vectors are how AI models understand and process information.

These vectors can be “rotated” in abstract mathematical space. TurboQuant leverages this property through a process known as random preconditioning, which reorients vectors to make them easier to compress without losing critical information.

Following this transformation, the system applies a quantization step to reduce the data size. To correct residual errors introduced during compression, TurboQuant uses an additional method called QJL, which draws on the Johnson-Lindenstrauss Transform. This technique preserves relationships between data points while reducing dimensional complexity, effectively maintaining accuracy with minimal memory overhead.

In internal tests, Google applied TurboQuant to several open-source large language models. The results suggest that models can operate with as little as one-sixth of their typical memory requirements while also improving performance on certain long-context tasks.

TurboQuant has the potential to lead to efficiency gains and systems that require less memory during inference. But it does not promise to solve the AI-driven RAM shortages for training these models.

Topics

About the Author

Tags:

AI inference Google AI

Topics

Share