When the bottleneck isn’t the GPU: DeepSeek’s DualPath paper
Here’s a claim that sounds wrong until you sit with it. You can buy more GPUs to serve an AI model, and inference can still fail to get faster. The expensive accelerators just sit there, waiting.
The DualPath paper, published in February, digs into exactly this. The part I found most interesting was not the fix. It was the diagnosis.
The thing nobody optimizes for
Almost all public attention in LLM performance goes to compute. Bigger GPUs, better kernels, smarter attention. DualPath makes the case that for one increasingly common workload, the agentic kind, the limiting factor has quietly moved somewhere else: storage I/O.
The reasoning comes from a production trace, and the numbers are worth stating plainly because they do the arguing for you. In their agentic workloads, a session runs around 157 turns on average, the context grows to roughly 32,700 tokens, but each new turn only adds about 429 tokens. So more than 98 percent of the context is the same material from previous turns. They measure a KV-cache hit rate of 98.7 percent.
That sounds like great news. Almost everything is cacheable, so the system should not have to recompute it. And it does not. But the cache has to live somewhere, and in this architecture it lives on remote storage.
Before the model can do anything with a new turn, it has to pull that giant cache from storage into GPU memory. For DeepSeek-V3.2 660B, the paper puts the pressure at around 22 GB of cache movement per PFLOP of compute.
Read that again.
The GPU is starved for data, not math. It spends its time reading, not thinking.
This is the “huge brain, tiny straw” picture some explainers use, and for once the metaphor is accurate. The brain is fine. The straw is the problem.
The asymmetry that makes it fixable
Modern serving often splits the work across two pools of machines. Prefill engines process the prompt. Decode engines generate tokens one at a time. They each have their own storage network cards.
Here is the catch DualPath exploits. In the standard setup, only the prefill side reads the cache from storage. So the prefill engines’ storage cards are pinned at the limit while the decode engines’ storage cards sit nearly idle.
You paid for both. You are using half.
The fix, in one sentence
Let the idle decode engines help with the reading.
DualPath adds a second route: load the cache through the decode engine’s storage card, then hand it over to the prefill engine across the fast compute network using RDMA. Now both sets of cards are working instead of one.
The obvious objection is that the compute network is also where the model does its own time-critical communication, including all-to-all traffic that token generation depends on. Dumping a flood of cache traffic onto that fabric would just trade one jam for another.
Their answer is traffic control. Cache transfers run on low-priority lanes that only use bandwidth the model is not using, plus a scheduler that decides per request which path makes sense based on current load. Cache loading fills the gaps without shoving aside the work that has a deadline.
What it actually buys
The reported gains are up to 1.87x offline throughput and, on average, 1.96x online serving throughput while still meeting latency targets.
No new hardware. No model changes. No cache compression. The implementation is roughly 5,000 lines of changes on top of their existing system, and it scales to deployments running large numbers of agents at once.
I want to be precise here because it is easy to flatten this into a simple percentage-improvement story. The honest version is the throughput multipliers above: close to double in the good cases by using machines you already bought more completely.
Why I’m writing about a datacenter paper
Let me be clear about what this is not. This is not something you apply to your Ollama box. Disaggregated prefill and decode pools, RDMA fabrics, and remote cache storage are not part of a single-machine local setup. If you came here hoping for a config flag, there is not one.
What does carry over is the mental model, and that is the reason I bothered.
As soon as you start running agents instead of one-shot prompts, the cost structure changes shape. Each step reuses a big pile of accumulated context and adds almost nothing new. The work stops being only “generate tokens” and becomes “move the history around fast enough.”
You feel a tiny, local version of this the first time a long-context session on your own hardware slows to a crawl, not because the model got dumber, but because something has to shuttle all that state through memory on every turn.
The takeaway I keep coming back to is this: we spent years assuming compute is the thing you scale. For agentic workloads, the realized value of your compute increasingly depends on whether you can feed it.
DualPath is one answer to that. The more useful part is the question it makes obvious:
Before you buy more GPUs, are you sure the GPUs are the part that is waiting?
The paper is open if you want the details. Whether the technique ends up in the open-source serving stacks the rest of us actually use is the part worth watching.
Paper Link: https://arxiv.org/abs/2602.21548
