With over a decade of experience in top tech companies like Sun Microsystems and Dell, James leads the technology roadmap and analytics for AI and High-Performance Computing (HPC) solutions at DDN. Holding a PhD in Theoretical Physics, he brings deep scientific insight into advancing full-solution performance across a wide range of industries - from Life Sciences to Finance. Since joining DDN in 2017, James has played a key role in shaping cutting-edge storage solutions designed to meet the demanding needs of AI-driven environments.
As LLM context windows scale to millions of tokens, the KV cache working set outgrows GPU HBM and spills to storage. Most current deployments treat this as a memory hierarchy problem and reach for DRAM or NVMe as a drop-in buffer — but this approach leaves most of the performance available in modern flash devices on the table. Fundamentally, this is a tiering problem: as the KV cache working set exceeds HBM capacity, the system must tier data across HBM, DRAM, and flash — and the efficiency of that tiering determines end-to-end inference latency at scale.
This talk examines the mismatch between how inference engines access KV cache — fine-grained, latency-sensitive, highly random at the page level but with exploitable locality at the sequence level — and how commodity NVMe is typically driven. We explore a set of optimisations that close this gap: access pattern reshaping to align with flash, optimising the network data path, parallelising across NVMe and reducing I/O amplification through smarter KV cache eviction policies.
We present results using common AI inference frameworks and growing use cases and outline what a flash-native KV cache fabric should looks like in practice