Thomas Prohofsky serves as a technologist in Seagate’s Office of the CTO focusing on Cloud Native storage architectures and AI data pipelines. In his role, Thomas is given the opportunity to research innovative solutions to the increasing demand for data from machine learning deployments on private and edge clouds.
As LLMs scale to billions of parameters and handle complex, multi-turn workloads, inference efficiency is no longer determined solely by compute power — but by how intelligently KV cache is managed across memory and storage tiers. This talk explores a novel architecture that situates KV caching at the critical junction between GPU memory and hybrid storage. Using Linux volume groups and SPDK for NVMe over Fabrics, we treat SSD/HDD tiers as active memory extensions, not passive backends. Frequently accessed KV states remain in fast layers; less active data moves to cost-efficient storage — eliminating redundant attention recomputation. Integrated with the Dynamo KB Block Manager and dynamic logical volumes, this reduces time-to-first-token and power consumption, while easing GPU memory (HBM) pressure. Result: higher concurrency, more simultaneous users — without sacrificing responsiveness. The system adapts to real-time workload patterns, improving throughput and lowering operational cost. A practical, scalable solution for production LLM deployment.