Harry Kim is the CPO and co-founder of XCENA. He is an experienced system software architect with 10 years of SoC expertise. He has developed hardware and software solutions for memory, storage, and processor systems. Before joining XCENA, he worked at SK Hynix as an SSD controller and firmware architect.
As AI transitions toward long-context models, fixed HBM capacity creates a critical "memory wall" and resource underutilization. This session introduces an architectural solution: scaling KV caches through transparent CXL memory-storage subsystems. We present a hardware-managed framework that unifies DRAM and NAND into a single, byte-addressable "InfiniteMemory" tier. Implemented via a CXL backend for LMCache, this approach is compared against RDMA-based frameworks such as NIXL and Mooncake.Unlike RDMA-based disaggregation—often limited by software overhead and network jitter—our CXL architecture manages memory mapping and latency masking directly within the hardware controller. We will explore:- Performance: CXL as a cost-efficient, high-performance alternative to RDMA networks.- Latency Management: Hardware-level techniques to mask NAND latency for seamless LLM inference.- TCO Optimization: Offloading memory from GPU to enable multi-terabyte, platform-agnostic KV cache.This presentation provides an architectural insight for building next-generation, memory-centric datacenters for the future of AI infrastructure.