Randy Kreiser has over 3 decades of experience in the High-Performance Computing and Storage industry. Currently Randy is Field Chief Technology Officer for Graid Technology specializing in high-speed storage architecture and file systems. His vast experience puts him in a unique position to understand needs of the HPC customer. Randy has primarily supported the DoD and Intelligence Community for the majority of his career in HPC. Randy has architected some of the largest data ingest high performing storage systems and file systems in the world. Being an early adopter of solid-state disk and software defined storage continued his focus for delivering the fastest storage systems in the industry.
Explosive AI growth requires us to reinvent the rules of storage. As context windows and concurrent sessions grow, LLM inference is quietly hitting a wall where KV cache, not FLOPs, becomes the real performance bottleneck; and the traditional options (more GPUs, more HBM, shorter prompts) are all painfully expensive.In this session, Supermicro and Graid Technology present a tiered KV cache design that turns dense NVMe-backed GPU servers into a high-performance KV cache tier that lets you scale context, concurrency, and sessions per node without blowing up your GPU budget. Using Supermicro NVMe-dense GPU platforms with Graid SupremeRAID™, the architecture turns SSD into a high-throughput, resilient KV cache tier with full enterprise RAID protection (0/1/5/6/10). We will also discuss the 5 tiers of KV cache storage and how the a large scale disaggregated inference workflow partitions the KV cache data.1. HBM on GPUs2. CPU DRAM on the storage server3. Local SSD on the storage server 4. KV cache storage using DPUs5. Network storage which can be File or Object.