TurboQuant: How Google Cut LLM Memory Usage 6x Without Losing Accuracy
Google's TurboQuant compresses LLM key-value caches to 3 bits with zero accuracy loss. No retraining, no fine-tuning. Accepted at ICLR 2026.

TL;DR: Google Research published TurboQuant -- an algorithm that shrinks LLM memory usage 6x with zero accuracy loss. No retraining, no fine-tuning. On benchmarks, it scores identically to the uncompressed version. The models aren't getting smaller. The infrastructure to run them just got dramatically smarter.
Why is your AI model burning 6x more memory than it needs to?
Here's something most people don't realize about running a large language model. The model itself -- its weights, its parameters, the thing you downloaded -- isn't the memory problem. The problem is what happens when you actually use it.
Every time an LLM generates a word, it keeps a running tab of every previous word it's considered. That tab is called the key-value cache. For a short conversation, it's fine. For a long one -- the kind where you paste in a document and ask questions about it -- that cache can grow larger than the model itself.
Think of it like a restaurant kitchen. The recipe book (the model) fits on a shelf. But the prep area (the cache) keeps growing with every order. Eventually the kitchen runs out of counter space, and it doesn't matter how good the recipes are.

This is the bottleneck that determines how many people can use an AI model at once, how long a conversation it can handle, and ultimately how much it costs to run. And until now, every attempt to shrink the cache meant the model got a little dumber.
TurboQuant, published by Google Research in March 2026, figured out how to clear that counter space without throwing away a single ingredient. 1
What did Google actually find?
The core insight is surprisingly intuitive once you strip away the math.
Imagine you're giving someone directions. "Go 3 blocks east, then 4 blocks north" takes two numbers. But "go 5 blocks at a 37-degree angle" also takes two numbers -- and that second form is much easier to compress. The distance (5) and the angle (37) are independent of each other, while "3 east, 4 north" are tangled together.

That's essentially what TurboQuant does. It rotates the data into a form where the components become independent and predictable. Once they're predictable, you can represent them with far fewer bits. 2
Previous methods tried to compress the tangled version directly, which meant they had to store extra bookkeeping data -- correction factors for every chunk of memory. That bookkeeping alone ate 1-2 bits per number. TurboQuant's rotation eliminates the need for bookkeeping entirely. 1
Then a second stage catches whatever small errors remain, storing just the direction of each error -- not the magnitude, just "was this slightly too high or too low?" One bit. That's enough to keep the final results unbiased. 2
The researchers -- Amir Zandieh (Google Research), Majid Daliri (NYU), Majid Hadian (Google DeepMind), and Vahab Mirrokni (Google Fellow) -- proved this approach lands within 2.7x of the theoretical limit. 2 You can't do much better without breaking the laws of information theory.
And the whole thing is model-agnostic. No retraining, no fine-tuning, no calibration. You plug it into any LLM and it works. Accepted at ICLR 2026, which tells you the theory is solid. 3
How much difference does 6x actually make?
Numbers are easy to gloss over, so let me make these concrete.
Llama 3.1-70B -- a serious, production-grade model -- could handle about 109,000 tokens of context on a single GPU. With TurboQuant, that jumps to 536,000 tokens. Same GPU. Same model. Five times the memory for conversations, documents, and context. 1
On benchmarks, 3.5-bit TurboQuant scored 50.06 on Llama 3.1. The full uncompressed 16-bit version? Also 50.06. Not "close." Not "comparable." The same number. 2 4
The previous best method, KIVI, scored 48.50 at 3 bits. That might not sound like a big gap, but in a field where every fraction of a point gets argued over in papers, it's a generation apart. 2
On H100 GPUs, TurboQuant delivers up to 8x speedup in the attention computation. Tested across Gemma, Mistral, Llama 3.1, Qwen 3.5 -- zero degradation on every benchmark they threw at it. 1
Compression research almost always involves a trade. You give up some accuracy for some space. The whole field is built on negotiating that trade gracefully. TurboQuant found a way to stop negotiating.
Why did the open-source community move so fast?
Google hasn't released official code yet -- expected Q2 2026. The community didn't wait.
Within days, a llama.cpp fork had TurboQuant running on Apple Silicon. 5 PyTorch reimplementations appeared from scratch. Triton kernels went up on GitHub. A vLLM feature request is tracking native integration. 6
Some developers pushed all the way to 2-bit precision within hours. At that extreme, quality starts to degrade -- but at 3-3.5 bits, everything stays clean. 5
When people race to implement a paper before the authors release code, it tells you the problem being solved is real. Everyone running inference at scale has been fighting this exact memory bottleneck. TurboQuant isn't solving a theoretical problem. It's solving Tuesday.
What does this change about who can run AI?
We spent the last few years in a scaling-up era -- bigger models, more GPUs, more memory, more power. TurboQuant is part of a quieter shift: scaling down the overhead.
A serving cluster that handled N users now handles 5N. A model that required a cloud GPU might run locally. Applications that were too expensive become viable -- always-on agents, real-time processing, long-document analysis on a laptop.
This matters most for the teams that aren't Google or OpenAI. The ones trying to run capable models on constrained hardware. The ones deciding between "we need a bigger GPU" and "we need to find a cheaper way." TurboQuant just made the cheaper way real.
The paper is public. The implementations are appearing. The question now is how fast this becomes the default.
Update: Two days after this post, TurboQuant tanked memory chip stocks -- SanDisk fell 11%, Micron 7%, SK Hynix 6%. The market's reaction, and why history says they're wrong, is worth reading alongside this.
Key takeaways
- TurboQuant compresses LLM memory usage 6x with zero accuracy loss -- no retraining, no fine-tuning
- The trick: rotate data into a form that's naturally easier to compress, then catch residual errors with a single bit
- Llama 3.1-70B context jumps from 109K to 536K tokens on the same GPU
- Community implementations already running before Google released official code
- This shifts AI economics: same models, same hardware, dramatically more capacity
Frequently asked questions
What is TurboQuant?
TurboQuant is a compression algorithm from Google Research that reduces LLM memory usage by 6x during inference, with zero accuracy loss. It works by rotating data into a more compressible form and correcting residual errors with a single bit. It plugs into any transformer model without retraining or calibration. 1
How does TurboQuant achieve zero accuracy loss?
It rotates data vectors so their components become statistically predictable, then compresses them using precomputed lookup tables -- no lossy per-block calibration needed. A second stage stores just the direction of each remaining error (one bit), which keeps the final computations mathematically unbiased. Together, the two stages hit near the theoretical limit of lossless compression. 2
What models does TurboQuant work with?
TurboQuant is model-agnostic. It has been tested on Llama 3.1, Gemma, Mistral, and Qwen 3.5 with identical results across all benchmarks. No retraining, no calibration, no model-specific tuning required. 1 2
Is TurboQuant open-source?
Google hasn't released official code yet (expected Q2 2026). Community implementations are already running on Apple Silicon (llama.cpp), PyTorch, Triton kernels, and vLLM, with active work on integrating it into mainstream inference frameworks. 5 6
What does TurboQuant mean for AI costs?
A GPU that served N users now serves 5N. A model limited to 109K tokens now handles 536K. Same hardware, same accuracy, dramatically lower cost per interaction. This changes who can afford to run AI at scale -- not just frontier labs, but smaller teams running on constrained hardware. 1
I break down things like this on LinkedIn, X, and Instagram -- usually shorter, sometimes as carousels. If this resonated, you'd probably like those too.