Sandeep Kaipu is an Engineering Manager at Broadcom. With over 20 years of experience in enterprise AI infrastructure and engineering leadership, he has driven large-scale platform innovations, holds multiple patents, and authored the book “AI Engineering Leadership”. He advises academic programs, speaks at leading technology conferences, and focuses on aligning AI engineering strategies with enterprise business value. His work spans enterprise platforms, AI inference at scale, and strategies for securing the next generation of Agentic AI systems.
Modern large language models (LLMs) such as DeepSeek-R1, Llama-3.1-405B, and emerging trillion-parameter architectures are fundamentally constrained by memory capacity, memory bandwidth, and data movement latency. Serving these models increasingly requires distributed inference architectures spanning multiple GPU nodes, where memory hierarchy and interconnect performance become the dominant system bottlenecks.
This session presents a deep technical exploration of how distributed GPU memory systems and high-speed interconnect technologies enable large-scale AI inference. Drawing from a production reference architecture deployed on VMware Private AI infrastructure, the talk examines how GPUDirect RDMA over InfiniBand enables direct GPU-to-GPU memory transfers across nodes while bypassing CPU memory copies and minimizing PCIe overhead.
The session analyzes the architectural building blocks required for scalable AI memory systems, including NVIDIA HGX platforms, NVLink/NVSwitch intra-node GPU fabrics, RDMA-enabled inter-node networking, and distributed GPU memory orchestration across Kubernetes clusters.