Dongjoo Seo is an engineer and researcher at Samsung, focusing on system-level performance for large-scale AI workloads. His current interests include KV-cache efficiency for foundation model inference covering memory footprint reduction, bandwidth-aware caching, and end-to-end latency optimization across modern memory and storage hierarchies.
As AI paradigms shift from computing-centric to data-centric, memory architecture innovation has become a critical imperative. Modern LLM services demand not only HBM’s extreme bandwidth but also unprecedented capacity expansion for KV Cache and RAG workloads.In this session, we propose a CXL-based Heterogeneous Memory Hierarchy to address these challenges. We introduce a Computing Offloading mechanism at the memory device level to minimize data movement between the CPU and memory, significantly improving latency and effective bandwidth utilization—common bottlenecks in large-scale AI inference.Furthermore, we present a scalable capacity strategy using CXL Memory Pooling to transcend individual node limitations through dynamic resource allocation. Moving beyond theory, we provide empirical evaluation data from real-world environments, proving enhanced AI inference performance and resource efficiency over conventional architectures. Building on our previously published research in IEEE (2025) regarding RAG optimization, we conclude with practical architectural guidelines for the next generation of data-centric AI infrastructure.
Transformer-based generative AI has turned the key/value (KV) cache into one of the largest and most performance-critical working sets in modern AI systems. As context windows grow and request concurrency rises, KVCache capacity and bandwidth increasingly determine latency, throughput, and total cost; often driving decisions around GPU/HBM sizing, host memory, and storage tiering. This session brings together system builders and memory/storage architects to examine KVCache management end to end: data layout and access patterns; paging, allocation, and eviction; compression and quantization; multi-GPU and multi-node sharing; tiering and offload to host DRAM and NVMe/SSD; and reliability, isolation, and security considerations in multi-tenant deployments. We will connect software techniques to emerging hardware directions (e.g., higher-bandwidth memory, pooling/tiering, and disaggregated memory/storage) and highlight where cross-layer co-design is needed. Attendees will leave with a practical taxonomy of KVCache techniques, guidance on when to use each approach, and a set of metrics and workload characteristics to evaluate solutions in production.