Hyeongseok Gwak, TL at SK hynix Memory Systems Research, leads LLM inference analysis & AI serving platform R&D. Focused on AI inference optimization & scalable Data Analytics Platforms for efficient, real-world deployment.
As LLMs increasingly handle long-context workloads, the memory pressure on KV caches has emerged as a critical bottleneck for performance and scalability. We propose an architecture that offloads KV cache from HBM to CMM-Ax and performs sparse attention operations directly on Processing-in-Memory (PNM). By exploiting the inherent characteristics of sparse attention, we design an architecture that maximized PNM's bandwidth utilization and fully capitalized on the PNM's scalable capacity. Built atop Ethernet-based node-level disaggregated architecture, the end-to-end system integrates real PNM hardware, RoCE v2 stack, and device-level optimizations. We implement split-batch routing and parallel execution with GPU attention to maximize GPU utilization and consequently alleviate Head-of-Line (HoL) blocking during long-context inference — significantly improving overall system efficiency.