James Coomer | SVP, Products
DDN

With over a decade of experience in top tech companies like Sun Microsystems and Dell, James leads the technology roadmap and analytics for AI and High-Performance Computing (HPC) solutions at DDN. Holding a PhD in Theoretical Physics, he brings deep scientific insight into advancing full-solution performance across a wide range of industries - from Life Sciences to Finance. Since joining DDN in 2017, James has played a key role in shaping cutting-edge storage solutions designed to meet the demanding needs of AI-driven environments.

Appearances:

Future of Memory and Storage - Day 2 @ 10:05

Flash-Native Inference: Redesigning the I/O Path for Large Context LLM Serving

As LLM context windows scale to millions of tokens, the KV cache working set outgrows GPU HBM and spills to storage. Most current deployments treat this as a memory hierarchy problem and reach for DRAM or NVMe as a drop-in buffer — but this approach leaves most of the performance available in modern flash devices on the table. Fundamentally, this is a tiering problem: as the KV cache working set exceeds HBM capacity, the system must tier data across HBM, DRAM, and flash — and the efficiency of that tiering determines end-to-end inference latency at scale.

This talk examines the mismatch between how inference engines access KV cache — fine-grained, latency-sensitive, highly random at the page level but with exploitable locality at the sequence level — and how commodity NVMe is typically driven. We explore a set of optimisations that close this gap: access pattern reshaping to align with flash, optimising the network data path, parallelising across NVMe and reducing I/O amplification through smarter KV cache eviction policies.

We present results using common AI inference frameworks and growing use cases and outline what a flash-native KV cache fabric should looks like in practice

James Coomer, SVP, Products, DDN

last published: 19/May/26 18:25 GMT

back to speakers