David Wang is a Technical Director at Silicon Motion, where he has led HW/FW co-design and FW architecture design for PCIe Gen5, Gen6 and next-gen enterprise SSD controllers since 2020. His work focuses on high-performance I/O processing, controller micro-architecture, optimized HW/FW codesign for data-center workloads. From 2006 to 2019, David held senior technical roles at PMC-Sierra, Microsemi, and Microchip, where he focused on high-performance firmware architecture for enterprise storage silicon, including SAS expanders, SAS HBAs, PCIe storage switches, and NVMe controllers. He has extensive experience spanning host interfaces, storage protocols, and controller firmware design for large-scale data-center systems.
Agentic AI workloads increasingly require near-GPU storage access with ultra-fine-grain I/O, often well below the conventional 4 KiB block size. Emerging GPU-initiated storage models highlight the need to sustain very high IOPS while efficiently utilizing PCIe Gen6/Gen7 bandwidth, potentially with fewer SSDs per GPU domain.However, mainstream data-center SSDs are architected around 4 KiB host I/O assumptions, targeting millions of IOPS per device at higher PCIe generations. When host I/O sizes shrink to 512 B or smaller, overall throughput degrades sharply due to PCIe command and completion overheads, interrupt pressure, and limited per-I/O processing capacity within the SSD controller. As a result, achieving even a modest fraction of peak PCIe bandwidth under fine-grain I/O becomes a fundamental architectural challenge.This presentation introduces a set of multi-level I/O coalescing techniques in next-generation SSD controllers designed to address these bottlenecks. The proposed approach spans NVMe queue-level aggregation and command-level coalescing, substantially reducing PCIe transaction overhead per effective I/O while preserving host-visible fine-grain I/O semantics.