Mohamad El-Batal serves as the Seagate’s Office of the CTO Chief Systems Technologist focusing on Storage Systems Software and Emerging Memory Solutions. In his role, Mohamad is given the opportunity to help shape the systems strategy and its future foundational technology roadmaps.
As LLMs scale to billions of parameters and handle complex, multi-turn workloads, inference efficiency is no longer determined solely by compute power — but by how intelligently KV cache is managed across memory and storage tiers. This talk explores a novel architecture that situates KV caching at the critical junction between GPU memory and hybrid storage. Using Linux volume groups and SPDK for NVMe over Fabrics, we treat SSD/HDD tiers as active memory extensions, not passive backends. Frequently accessed KV states remain in fast layers; less active data moves to cost-efficient storage — eliminating redundant attention recomputation. Integrated with the Dynamo KB Block Manager and dynamic logical volumes, this reduces time-to-first-token and power consumption, while easing GPU memory (HBM) pressure. Result: higher concurrency, more simultaneous users — without sacrificing responsiveness. The system adapts to real-time workload patterns, improving throughput and lowering operational cost. A practical, scalable solution for production LLM deployment.