Hyeongseok Gwak | AI platform engineer
SK hynix

Hyeongseok Gwak, TL at SK hynix Memory Systems Research, leads LLM inference analysis & AI serving platform R&D. Focused on AI inference optimization & scalable Data Analytics Platforms for efficient, real-world deployment.

Appearances:

Future of Memory and Storage - Day 3 @ 09:15

Enhancing Al Inference Performance via CMM-Ax Acceleration

As LLMs increasingly handle long-context workloads, the memory pressure on KV caches has emerged as a critical bottleneck for performance and scalability. We propose an architecture that offloads KV cache from HBM to CMM-Ax and performs sparse attention operations directly on Processing-in-Memory (PNM). By exploiting the inherent characteristics of sparse attention, we design an architecture that maximized PNM's bandwidth utilization and fully capitalized on the PNM's scalable capacity. Built atop Ethernet-based node-level disaggregated architecture, the end-to-end system integrates real PNM hardware, RoCE v2 stack, and device-level optimizations. We implement split-batch routing and parallel execution with GPU attention to maximize GPU utilization and consequently alleviate Head-of-Line (HoL) blocking during long-context inference — significantly improving overall system efficiency.

Hyeongseok Gwak, AI platform engineer, SK hynix

last published: 23/Jun/26 15:35 GMT

back to speakers