
Create your personal agenda –check the favourite icon
Registration open throughout the day in the Santa Clara Convention Center, First Floor.
Santa Clara Convention Centre, First Floor
Chair's Remarks
Chair's Remarks
Chair's Remarks
Chair's Remarks
Chair's Remarks
Chair's Remarks
Chair's Remarks
Chair's Remarks
On edge devices, KV cache reuse is a critical optimization for reducing prefill latency and memory bandwidth in large language model (LLM) inference. Existing systems support longest prefix reuse, which fails to capture the partial, shifted, suffix, and mid span overlaps common in real world workloads such as conversational AI, retrieval augmented generation (RAG), and templated prompts. This paper presents a hierarchical span indexing architecture that enables efficient detection and validation of arbitrary reusable token intervals without incurring quadratic storage costs.
Our approach combines three components: (1) a segment tree representation with composable rolling hashes for logarithmic time reconstruction of arbitrary span fingerprints, (2) prefix and suffix tries for fast boundary match discovery, and (3) a coarse block level hash index for constant time retrieval of interior span candidates.
Together, these structures enable robust reuse of KV cache segments even when requested spans are shifted or misaligned relative to stored segments. The system requires only linear storage per sequence, 30% lower prefill latency with 2× KV reuse for scalable edge LLM serving
Artificial intelligence is reshaping every layer of the computing stack, as well as the storage and memory hierarchy. To meet emerging AI workloads, NVIDIA has introduced the Storage-Next architecture, adding a new tier focused on ultra-high-IOPS random-access performance for small 512-byte I/Os. In this work, we address the key challenges that conventional SSD controllers face in achieving the theoretical maximum PCIe bandwidth with small block sizes. An advanced error-correction scheme tailored for fine-grained, latency-critical storage is essential to transform today’s SSDs into Storage-Next-ready devices. With this advancement, we demonstrate exceptionally low latency while delivering robust reliability. Additionally, we introduce neural network (NN)-assisted LDPC error correction, highlighting the potential of AI-powered approaches to enhance ECC reliability in next-generation AI storage and memory systems.
CXL offers a very broad range of features that simplify computing architectures that have been hard to implement. How are system architects likely to use it? Will it have a beneficial or harmful effect on the memory chip market? In this presentation, Jim Handy takes an in-depth look at the CXL market and estimates its potential adoption for a broad range of applications, pulling together a forecast that very soundly argues for rapid market growth through 2030. Attendees will gain insights to help them create a successful CXL Strategy.
Soft-input reads in NAND flash memory require multiple read operations, creating significant performance overhead. Consequently, NAND flash memory devices usually operate with limited number of read operations and various types of DSP techniques are applied to maintain high read accuracy. In this work, we introduce a deep neural network (DNN)-based read flow that enhances read performance under limited number of reads by dynamically adapting read thresholds on a per–row basis. Unlike conventional controllers that apply uniform thresholds across an entire NAND block, the proposed system uses a compact DNN estimator to determine optimal thresholds for each row in real time. This reduces read-retry rates, improves throughput, and operates efficiently in streaming mode. A lightweight threshold table stores per-block indices, minimizing metadata overhead while enabling fast per-row threshold selection. Additionally, the system employs a small number of optimized “mock reads” that provide highly accurate threshold estimation even under severe device stress. The proposed DNN-based read flow offers a scalable, efficient solution for next-generation NAND flash controllers.
Growing AI demand for servers absorbs a significant capacity in global memory industry, with the resulting supply constraints cascading across consumer applications
This presentation will provide an updated overview of Memory Market. During the session, it will demonstrate a comprehensive analysis of the NAND Flash market, encompassing market size projections from 2026 to 2031, along with observations on market dynamics driven by bit supply growth and price fluctuations. In addition, the analysis extends to the current AI accelerator chip market and potential future technological developments — such as High Bandwidth Flash (HBF) — to assess the market gap induced by AI computing demand. This presentation will be further supported by an examination of process technology advancements and the capacity expansion plans and intentions of key industry players, with the objective of analyzing future supply-demand equilibrium and its broader implications for market applications
AI infrastructure buildouts face NAND and DRAM supply constraints and demand faster deployment. Storage decisions can no longer assume ideal component availability or long qualification timelines. This session focuses on practical drive characterization and selection strategies that work under real‑world constraints, emphasizing QLC SSDs, limited‑DRAM designs, and large‑scale deployments.
The discussion centers on workload‑aligned metrics—latency consistency, sustained throughput, write behavior, and endurance—rather than peak benchmarks. We also explore how these trade‑offs map to emerging AI system architectures, including NVIDIA‑declared ICMS, where storage behavior directly impacts system efficiency and cost.
The session covers techniques to accelerate drive qualification at scale, including OCP inspired workload tuning, simplified preconditioning, and faster qualification. These approaches reduce time‑to‑deployment while maintaining confidence in drive behavior across fleets.
File system is a fundamental and crucial technology of managing a digital data. But current file system stack introduces multiple drawbacks. Machine Learning (ML) technologies need to be adopted in file system stack and AI agents become a new customer of digital data. Cognitive file system concept could enhance the old and robust file system technologies with the goal of managing data more efficiently and satisfying new types of customers. Cognition feature implies that file system can “recognize” repeatable patterns and relations in raw data streams. User can still store the raw data in streams/files but without the necessity to give them names. Cognition subsystem can detect repeatable patterns in raw data streams and build a dictionary. Patterns in dictionary can work as keywords that can build relations among data streams likewise a relational database. Cognitive file system concept can build completely new technological foundation for computation offloading in storage space, adopting ML models for data analysis/processing, and provide a flexible interface for more efficient interaction among AI agents and data.
he rise of Physical AI marks a fundamental shift from systems that merely sense and report data to platforms capable of making real-time decisions in the physical world. As intelligence moves from the cloud to the edge, vehicles and autonomous machines are no longer passive data collectors but active decision-makers operating under strict latency, reliability, and safety constraints. This transition dramatically increases the volume and criticality of data generated at the edge, with next-generation platforms potentially producing terabytes of data per hour that must be processed, stored, and accessed locally without delay.This presentation examines how Physical AI reshapes storage requirements across emerging domains such as robotaxis, full self-driving systems, humanoid robots, and autonomous mobile robots. While these applications differ in form factor and use case, they increasingly share common storage behaviors, including large local AI model storage, high random read performance for real-time inference, high endurance for continuous operation, and power-efficient design for edge deployment.
KV cache is the attention "memory" LLMs build during prefill, enabling token generation during decode without recomputing prior context. As prompts grow longer and sessions become multi‑turn or agentic, KV cache size expands rapidly to include previous prompts and answers as "memory" for subsequent token generations, putting a strain on scarce GPU HBM and forcing tradeoffs across latency (TTFT), throughput, and cost. This pressure is driving tiered memory system architectures.
In this talk, we will share Micron's measured evidence that capacity‑centric designs, combined with targeted upgrades to lower‑cost tiers of fast, persistent storage, can materially improve both near‑term performance and long‑term returns on CapEx. We will also discuss drive‑level optimizations that complement some random and sequential access patterns of KV Cache I/Os. Attendees will take away actionable tiering heuristics grounded in balancing TCO across memory and storage hierarchy.
Emerging Storage-Next architectures, exemplified by NVIDIA’s vision of 100 M random read IOPS with 512-byte granularity, are redefining the performance requirements of solid-state storage. While such throughput will likely be delivered through multi-device aggregation, maximizing the IOPS capability of each individual SSD remains essential for scalability, efficiency, and tail-latency control.Achieving this level of fine-grained performance requires fundamental changes across the SSD stack, from host interfaces to controller processing and internal data paths. Many of these challenges can be addressed through architectural scaling. The most stringent constraints, however, arise at the NAND interface, where array read granularity, data transfer unit, and error-correction granularity must be carefully aligned to sustain extreme random access rates without compromising latency predictability.The primary limitation of BCH lies in its lack of native soft-decision decoding, which raises concerns for long-term reliability under intensive, memory-like access patterns, even when operating with SLC or pseudo-SLC NAND.
The talk will focus on the predominant use cases for CXL memory in the marketplace today. Talk will highlight some of the announcements related to this made by some major Hyperscaler companies. We will examine in detail their use cases. The talk will also look at the new interest for CXL based Memory Pooling in the AI/HPC & Distributed Database community. And examine the value proposition of Memory Pooling for these workloads.
In this work, we review integrated approaches to overcome challenges we have encountered in developing HARC etching process technology.In addition, to push beyond these limits, enabling technologies such as Metal Induced Crystallization (MIC), cell-to-cell bonding, and High Bandwidth Flash (HBF) should be introduced.
This presentation introduces Exceedance Plots as a powerful tool for understanding and debugging SSD performance. By highlighting infrequent large latency events, Exceedance Plots provide a valuable supplement to traditional debug methods. The presentation showcases real-world examples of Exceedance Plots in action, demonstrating their ability to identify complex tenant behavior and inform latency monitoring requirements. Real world characterization measurements and useful monitoring points for concern scenarios will be discussed.
Fixed Function Compute is a SNIA project to define methods to use SSD computation resources and memory resources to offload storage related compute operations. This project started by looking at RAID XOR calculations, Hash calculations, and Compression/Decompression functions. Additional operations may be examined in the future. This will enable reduced host CPU utilization, reduced host memory and network bandwidth utilization by bringing these computational operations closer to where the data is stored.
Flash memory has become a safety‑critical subsystem in autonomous vehicles, software‑defined vehicles and robotic mobility platforms, storing bootloaders, firmware, calibration data and safety‑relevant applications that directly impact controllability and hazard prevention. To address risks such as corruption, retention loss and latent faults, flash storage must integrate robust safety mechanisms (ECC, parity checks, redundancy) and comply with ISO 26262 hardware metrics. An AI‑enabled framework can automate key functional safety work products, including technical safety concepts, safety requirements, safety analyses (FMEDA, FMEA, DFA) and safety cases, when applied to semiconductor memory and safety‑critical hardware IP. This approach accelerates FMEDA development, improves random hardware fault modeling and delivers measurable benefits: up to 40% reduction in documentation effort, around 30% fewer review/rework cycles, and significantly improved consistency and audit readiness, strengthening ASIL‑oriented competitiveness in automotive and semiconductor markets.
Explosive AI growth requires us to reinvent the rules of storage. As context windows and concurrent sessions grow, LLM inference is quietly hitting a wall where KV cache, not FLOPs, becomes the real performance bottleneck; and the traditional options (more GPUs, more HBM, shorter prompts) are all painfully expensive.In this session, Supermicro and Graid Technology present a tiered KV cache design that turns dense NVMe-backed GPU servers into a high-performance KV cache tier that lets you scale context, concurrency, and sessions per node without blowing up your GPU budget. Using Supermicro NVMe-dense GPU platforms with Graid SupremeRAID™, the architecture turns SSD into a high-throughput, resilient KV cache tier with full enterprise RAID protection (0/1/5/6/10). We will also discuss the 5 tiers of KV cache storage and how the a large scale disaggregated inference workflow partitions the KV cache data.1. HBM on GPUs2. CPU DRAM on the storage server3. Local SSD on the storage server4. KV cache storage using DPUs5. Network storage which can be File or Object.
Direct liquid cooling (DLC) is the hot new thing. The E1.S and E3 form factor specifications have both added DLC support, to keep data warm and drives cool running at PCIe 6. But power is expected to increase as SSD bandwidth increases for PCIe 7 and PCIe 8. How will systems cool those higher power devices without tempers flaring? This talk will go through how Micron views the future of liquid and air cooling, discussing their interdependence and the challenges faced when optimizing a product for both cooling solutions.
Compute Express Link (CXL) promises to unlock memory disaggregation and composability at hyperscale, but deploying it in production fleets introduces a new class of system-level challenges. In this talk, we share practical insights from hyperscale environments on where CXL meets reality—covering issues such as latency variability, reliability at scale, firmware/software maturity, and observability gaps. We will discuss how these challenges impact large-scale AI and memory-intensive workloads, and outline the validation and design strategies required to make CXL viable in production data centers.
espite ongoing claims from AI chip startups, no computing paradigm today bypasses DRAM (or SRAM) for relevant AI workloads. As a result, DRAM effectively defines the performance, energy, and cost ceiling of modern AI systems. Even approaches such as resistive in-memory computing or near-memory computing ultimately depend on DRAM once model sizes become meaningful.Recognizing these limits, the industry has begun exploring 3D NAND flash as a more scalable memory foundation, particularly in the context of HBF. While 3D NAND offers clear advantages in density and cost, it cannot replace HBM: it is not suited for volatile workloads, suffers from access energy constraints, without significant higher bandwidth.We introduce 3D CapRAM, a capacitive in-memory computing technology that leverages the structural and economic advantages of 3D NAND while avoiding its drawbacks. By enabling both non-volatile and volatile AI workloads without memory wall, 3D CapRAM defines a new computing paradigm.We present benchmarking results across multiple models and configurations, demonstrating a consistent performance and efficiency advantage compared to existing or proposed architectures incl. HBM/HBF.
An overview of the architecture and implementation of high-speed device emulator, as well as early results from testing with NVIDIA's SCADA framework. SCADA significantly increases the I/O expectations for an SSD and the emulator for testing the application and infrastructure. We cover emulator example use cases and show the value it brings for SCADA.
AI algorithm complexity is increasing rapidly, making the data flow between the compute elements and storage devices a bottleneck to keep data flowing to the processors. Concurrently, heat and power are placing constraints in data center design. These requirements demand more efficient data orchestration and offloading of the main compute elements by providing computation and pre-processing near the storage. MIPS has released highly efficient processors which can be optimized for these requirements. In this talk we will explore how the MIPS cpu cores based on the open, extensible RISC-V architecture can provide high-throughput and low-latency data management enabling tasks such as quantization, compression and encryption to be executed close to data storage. We explore a highly parallel system which enables tightly coupled accelerators to be used along with MIPS RISC-V processors providing optimized processing with very low latency. The system is scalable to deliver the throughput needed providing an effective and power efficient offload for the main processors to enable the data center to meet the increasing demands of AI growth.
When storage fails in systems that cannot be physically accessed, recovery often becomes impossible. Modern platforms increasingly deploy storage in locations where access is limited, costly, or unavailable, including soldered-down devices, sealed systems, automotive platforms, and dense datacenter deployments. While these designs offer clear benefits in integration and robustness, they also introduce a critical challenge: a firmware “brick” or critical failure can leave drives unresponsive, rendering traditional, access-dependent recovery methods ineffective.This presentation introduces a firmware recovery flow for storage devices operating in inaccessible environments. The approach enables a controlled recovery mode that can function even when normal operation fails, providing two key capabilities: extracting diagnostic logs from otherwise unresponsive drives, and clearing fatal error states to restore drive functionality in place. By treating recovery as a core firmware capability, this work outlines a practical path toward more resilient and serviceable storage systems where physical access cannot be assumed.
Composable fabric infrastructure for AI and cloud disaggregates compute, GPU, memory, and networking into modular, software-defined resource pools. Protocols such as PCIe and CXL eliminate hardware limits, enables memory pooling & sharing, improves GPU utilization, and enables independent scaling of resources while allowing dynamic, real-time allocation tailored to specific AI workloads. This talk presents a security blueprint for fabric-connected memory and storage, focusing on three questions: (1) what must be protected on the link vs. at endpoints, (2) how to manage identities/keys and establish secure sessions across heterogeneous devices, and (3) how to operationalize security enforcement without breaking performance and RAS targets.
The floating-gate was stopped at 28nm for embedded memory [1], a scalable flash memory becomes urgent. In this presentation, we will propose a solution by developing a 16nm FinFET-based flash memory, i.e., a versatile Resistive-Gate Memory (RG-RAM) supporting both NOR and NAND functionality. The unit cell is developed using a 3D 1TnR structure, i.e., multiple MIM resistance built on FinFET platform. The MIM is forming-free with a ~100x window which is then magnified via FinFET and provides a wide window, ION/IOFF ratio 10^5X. This enables 16-level QLC operation. A low voltage (<2V), high speed (10ns) at the chip level, excellent endurance(>10^8) and retention make it suitable for the embedded or standalone solution. This architecture has a comparable unit cell size as FG NAND ones with shorter stacked-layer height the same density as NAND (e.g., BICS). More importantly, it has a potential to replace the FG in further scaling to 16nm and beyond, i.e., this unique design can be easily extended to FinFET and nanosheet generations down to 1nm, either in stand-alone or embedded applications, without the limitation on the scaling.
KV cache offload to storage has emerged as a key mechanism to accelerate and reduce the cost of LLM inference. By strategically treating high-speed storage as a vast repository of pre-computed model state, inference platforms can instantly recall information for recurring prompts or multi-turn dialogues, drastically reducing the time-to-first-token and cost per token. However, due to the significant KV cache sizes, the massive data volumes generated by long-context models can quickly saturate even the fastest storage during high-concurrency workloads.In this talk, we discuss the storage challenges in involved in maintaining predictable KV cache access time at scale. We also show how a SOTA open-source inference platform can, without any changes, leverage a novel distributed tiered storage system that seamlessly combines a fast, low-latency, and high-endurance storage pool for hot KV cache data with a capacity-optimized pool for cost-effective storage of massive KV cache datasets. The storage system combines an optimized software-defined stack with heterogeneous SSDs and showcases the advantages of a flexible, DT-SSD architecture that exposes its internal data tiering.
This presentation introduces the PCIe L0p power state at a clear, high level, explaining what it is, how it works within the PCIe link, and why it matters for power efficiency. The goal is to give attendees a practical understanding without requiring deep protocol knowledge.
The session then focuses on how to measure real power into an SSD or PCIe device, including key measurement considerations and workload selection. A short case study will show measured power savings across different workloads, demonstrating how L0p can translate into meaningful efficiency gains in real systems.
Refreshment Break
AI NeoClouds are racing to differentiate but storage margins often flow upstream. Hyperscalers increasingly bundle Storage-as-a-Service with GPU compute, capturing performance premiums and tightening ecosystem control. Meanwhile, scale-out storage platforms (Weka, VAST Data, DDN, Hammerspace) power NeoCloud clusters, abstracting flash volatility while competing on throughput and latency guarantees.Beneath them sit controller, NVMe, and flash suppliers whose economics depend on density, endurance, and bandwidth scaling. Open-source stacks and bare-metal storage clusters offer an alternative path but with operational trade-offs.This panel maps the full stack: hyperscale services, NeoCloud offerings, software, defined storage layers, component vendors, and flash technology providers, asking who owns margin, who owns performance, and how AI-era storage models evolve.
Chair's Remarks
AI doesn’t just need more storage—it needs the right medium. Flash brings the essentials: density, speed, and performance per watt, with lower heat penalties than spinning media at comparable throughput. The question is whether today’s architectures let flash behave like the AI-optimized resource it actually is.
The Open Flash Platform (OFP) initiative is unlocking those inherent flash advantages at rack scale—reducing unnecessary data-path hops, minimizing CPU and DRAM overhead, and improving determinism for latency-sensitive AI pipelines. In this panel, ecosystem leaders will separate what’s real from what’s hype: where OFP delivers immediate wins (throughput-per-watt, density, and predictable performance) and which workloads and deployment patterns will adopt first—from AI training and inference to high-throughput analytics and content pipelines.
Panel Topics:*Eliminating overhead: fewer hops, less CPU/DRAM tax, more predictable latency*AI pressure test: feeding GPUs with consistent throughput and QoS isolation*Deployment models: hyperscale, enterprise, and hybrid designs that simplify operations*What must standardize next: observability
Chair's Remarks
Chair's Remarks
This presentation discusses the impact of NVIDIA’s Storage Next (SCADA) workloads on SSDs. We compare CPU vs GPU initiated I/O and how it influences future drive design. Storage Next uses SSDs as an expansion for GPU memory using SSDs allowing for processing of larger datasets. GPUs leverage tens of thousands of threads to potentially queue tens of thousands requests to the storage submitting thousands of random requests (read or writes) to each SSD. GPU based applications must also adapt to maximize I/O concurrency for best performance.
Rob Sykes is Director of Technical Product Marketing at Micron Technology where he focuses on the definition of Micron's SSD Enterprise Controllers. As a technology leader and visionary for over 25 years, he has been a key figure in the development of multiple generations of PCIe/NVMe products. He holds patents in ASIC/FW architecture and has been a presenter at FMS since 2012 covering FTL, Flash Architectures, Futures of Memory Storage and more. He holds an MSc in Computer Science from Bristol University (UK)
As DRAM shortages and rising costs drive the search for alternative memory technologies, low-latency flash memories, especially those attached via CXL, are emerging as cost-effective solutions for large-scale and AI-driven workloads. CXL-attached flash enables tiered and pooled storage, combining the reliability and speed of flash with CXL’s high bandwidth and low latency to overcome traditional storage bottlenecks. This talk highlights recent Linux kernel enhancements to DAMON, focusing on extending the CXL Hotness Monitoring Unit perf driver. The new patch improves DAMON’s ability to monitor and manage access patterns for long-latency CXL-attached flash memory. These optimizations pave the way for more efficient and scalable memory management in high-performance computing environments.
As we enter 2026, the memory market is undergoing a fundamental "decoupling." The traditional cycle driven by PCs and smartphones is being replaced by a structural supercycle centered on AI infrastructure. This session explores the "Structural Gap"—the critical 2026–2028 window where legacy manufacturing is repurposed for HBM4/4e and Ultra-High-Density NAND, creating a persistent supply-side bottleneck.
Key Discussion Pillars:
The Pivot: Analyzing the migration of wafer starts from low-margin consumer DRAM/NAND to high-value AI components.
The Supply Crunch: How the technical complexity of HBM4 (2048-bit) and advanced NAND stacking effectively limits global bit growth despite increased CAPEX.
The 2028 Outlook: Strategic roadmaps for navigating a market defined by capacity constraints rather than demand fluctuations.
Attendees will gain a definitive framework to understand this new memory hierarchy and actionable insights to bridge the gap in an AI-dominant landscape.
Chi-Square methodology has traditionally been widely used in the microelectronics industry for planning demonstration tests. While the Chi-square distribution is suitable for sample estimation when product reliability is unknown, it tends to significantly overestimate failure rate projections for products with known reliability characteristics. This often results in over-engineering of products or establishing substantially higher reliability targets. This paper proposes the use of Gamma distribution with a Kerman neutral prior for Reliability Demonstration Test (RDT) planning, particularly for products with existing prior information such as prequalification test data, component level reliability data et al. This distribution offers a more accurate and statistically appropriate approach, aligning better with point estimates and reliability assessments when using Maximum Likelihood Estimation. Utilizing this methodology in SSD sample planning helps in achieving reliability targets, either by using smaller sample sizes or by reducing the reliability test durations. This can help meet reliability objectives at a lower cost without compromising product quality and reliability and ad
As generative AI shifts toward hyper-personalization, the demand for "Personal AI Twins" has surged. However, current implementations face a critical trilemma: privacy concerns over raw data leaving the device, latency bottlenecks in software-based vector search on edge devices, and the lack of a secure monetization model for personal data. Traditional storage architecture remains a passive repository, failing to meet the high-dimensional data processing needs of modern embeddings.
This presentation proposes a novel Hardware-Accelerated Personal AI Twin Storage Architecture. We redefine the SSD as an active "Sovereign Identity Vault" that integrates data acquisition, real-time embedding generation, and high-speed vector retrieval within a single encrypted hardware module. By shifting vector database operations (indexing and similarity search) directly onto the storage controller, we eliminate data movement overhead and ensure raw data remains physically isolated from the host OS and cloud.
Modern SSD controller architectures are increasingly shaped by the high demands of AI workloads, which create conflicting performance, capacity, and cost requirements. To address these competing goals, existing NVMe FDP controllers offer performance isolation but cannot provide QoS differentiation. Further, many SSDs leverage the capability of multi-bit Flash devices to support a high-speed SLC mode for internal tiering or caching. However, these capabilities remain invisible to the application layer and often result in substantial unpredictable performancefluctuations.This presentation introduces a novel dual-tier controller architecture ST-SSD designed for modern workloads. By enabling simultaneous user-defined capacity and performance tiers within a single drive, the DT-SSD simplifies device inventorymanagement and enables workload-specific configurations. Leveraging the NVMe FDP specification, the DT-SSD ensures predictable QoS through guaranteed bandwidth and IO prioritization. We analyze the design trade-offs, provide performance benchmarks, and evaluate how dynamic over-provisioning, wear balancing, and heat-tracking algorithms contribute to overall system efficiency.
Due to the rapid scaling of AI models and the explosive growth of inference workloads, memory systems are reaching their limits in terms of capacity and cost. While HBM (High Bandwidth Memory) excels in ultra-high-speed data processing, its limited capacity has become a bottleneck in building large-scale AI infrastructure. This presentation will introduce HBF (High Bandwidth Flash), a next-generation memory technology designed to bridge this gap.I will explain HBF’s performance, capacity, and fundamental operational principles, along with its technical characteristics optimized for LLM inference. Additionally, I will discuss the features of AI applications best suited for HBF and present relevant use cases.Through these insights, I will aim to demonstrate how HBF contributes to the scalability of data centers and provide an outlook on the future development direction of HBF.
In this session Luis will present practical demonstrations and performance analysis of using CXL-based disaggregated memory to accelerate large-scale AI inferencing workloads in HPC and datacenter environments. The approach integrates a CXL JBOM (Just a Bunch of Memory) as an offload target for the KV cache, connecting it to NVIDIA’s Dynamo inference stack via Micron’s FAMFS, enabling the JBOM to operate as a warm memory file system. This architecture removes storage bottlenecks and significantly increases the effective memory bandwidth available to the KV cache. Preliminary results indicate a 5–10× speedup over traditional storage-backed KV cache implementations, highlighting the transformative impact of CXL memory pooling for next-generation inference systems. This work presents a scalable, standards-aligned approach for deploying memory-intensive inference pipelines in modern HPC systems and data centers.
This session breaks down DRAM and NAND market trend into insights that help explain today’s environment and what’s likely ahead. We highlight the key signals that have reliably marked past turning points and show how those same indicators are shaping the current cycle that is accelerated and reshaped by AI-driven demand.
The discussion also examines China’s expanding role as both a major consumer and an emerging supplier of memory influencing global trends.
Attendees will walk away with an understanding of where the industry stands today, what history tells us about the next phase, and how China’s trajectory factors into the outlook.
The exponential growth of AI model complexity and data volumes is reshaping the architecture of modern data centers. This session explores how innovations in storage technology—particularly SSDs—are enabling scalable, high-performance AI infrastructure. We examine the evolving storage demands across AI workflows, from training to inference, and highlight architectural considerations including memory and storage tiering, local and remote topologies, access path optimizations, and emerging SSD design strategies. Attendees will gain insights into how advanced data storage solutions are not only improving efficiency but also unlocking the capacity to train and deploy increasingly sophisticated AI models—driving the next wave of AI capabilities.
This study investigates the long-term reliability of Solid-State Drives (SSDs) for end of product life. The aim is to understand the performance of SSDs in the field once and to quantify the risks associated with extended operation. By evaluating wear-out indicators and failure modes, the work provides insight into how to build the reliability prediction model. The study incorporates physics-based life modeling techniques commonly applied in SSD reliability analysis, such as those described in SSD lifetime modeling research. We also leverage methodologies inspired by Life test, Reliability Demonstration Testing (RDT), and On-Going Reliability Test (ORT) used in broader SSD qualification frameworks. Various simulation models are applied to represent drive behavior under prolonged stress conditions, enabling prediction of SSD during end of product life cycle. These models help identify which factors—such as workloads, endurance rating variance, and environmental stress—most significantly affect life of an SSD. Ultimately, the study highlights how accurately we can predict SSD usage in the field.
Deduplication can be a useful technique for data reduction in many environments. However, it can come at an increased hardware and system cost. Furthermore it can severely degrade performance due to the heavy computation and analysis required to find and track duplicate data form the storage controller.By using computational offloads in the SSD, deduplication can be done will minimal added cost and computation to the storage controller. See how IBM FCM SSD has optimized data reduction using computational storage!
Most AI architectures today are shaped by hyperscale assumptions—elastic compute, stable power, and constant connectivity. While effective in cloud environments, these assumptions mat not hold in industrial settings. Industrial AI must deliver decade-long reliability, tolerate harsh operating conditions, and recover autonomously from failures. In these environments, silent failure is unacceptable, and hardware replacement is rarely feasible.Industrial AI therefore diverges from the hyperscale playbook. In this paradigm, memory and storage evolve from passive performance components into anchors of system continuity. They are responsible for preserving AI models, operational data, and recovery states when compute, OS, or networks fail. As a result, the fundamental design metric shifts from peak performance to survivability.By reframing memory and storage as strategic system elements rather than commodities, a new framework for availability, trust, and long-term value emerges. This approach enables the design of Industrial AI systems that are resilient, recoverable, and trustworthy by design—ensuring that operational truth is preserved even under the most challenging conditions.
Modern AI inference is increasingly constrained not by compute, but by persistent and heterogeneous state, including control-plane metadata, embedding retrieval, and write-heavy KV cache updates. When these access patterns converge on shared storage nodes, SSD controllers face conflicting latency and throughput demands, sustained write pressure, and contention across NAND channels and dies. General-purpose SSD behavior becomes a first-order limiter of inference scalability.This talk presents a first-principles exploration of SSD controller design for heterogeneous AI workloads. We analyze NAND plane occupation, channel scheduling, and write amplification under mixed small-block and update-heavy traffic. We advocate state-aware static placement for seamless pSLC/TLC coexistence and show why minimizing channel and die utilization conflict is critical in fabric-capped environments. We also examine how transparent intra-SSD compression reduces physical data movement and backend contention, enabling predictable multiplexing of diverse inference state at scale.
CXL enables scalable, cache-coherent memory expansion beyond DRAM. Conventional management treats CXL as a uniform extension, relying on reactive page migration based on access patterns. This identifies hot pages only after performance degradation, failing to prioritize latency-critical data and causing inefficient tiering.We introduce an AI-driven, proactive hot page placement framework that predicts memory demands, enabling early placement of hot pages into DRAM. To ensure real-time operation, we use XGBoost—a lightweight model running on CPU cores—avoiding GPU overhead and enabling nanosecond inference. For dynamic workloads, a feedback loop refines predictions using performance counters, ensuring adaptability without manual tuning. For stability, the framework integrates with kernel tracing and a CXL-aware allocator, using existing OS mechanisms without major modifications.By proactively placing critical pages in DRAM and warm/cold pages in CXL, the system minimizes latency, improves throughput and stabilizes tail latency. This optimizes resource utilization in latency-sensitive environments like real-time databases, gaming, and AI inference, unlocking CXL's full potential
Today’s memory shortages and sky-high prices are driving the chip market to astronomical heights, but this spending spree won’t last forever, and when it ends, the chip market will undergo some very dramatic changes. Join this session to see how Objective Analysis’ Jim Handy takes the lessons learned from past semiconductor cycles to plot the direction of tomorrow’s chip market.
High-performance SSDs use onboard DRAM for FTL tables and internal management, with DRAM capacities increasing as SSD densities grow. As SSDs adopt newer DRAM technologies like DDR5, a large portion often 40-50% of DRAM bandwidth remains underutilized by internal SSD operations. NVMe Controller Memory Buffer (CMB) and SLM (System Local Memory) features allow this unused DRAM bandwidth to be exposed to host applications. Leveraging CMB and SLM can unlock significant aggregate memory bandwidth across millions of SSDs in data centers, supporting high performance workloads and new use cases. This paper discusses the architectural considerations and performance benefits of tapping into SSD DRAM bandwidth for enhanced resource efficiency in modern storage systems.
Telemetry gathering during system runtime allows systems managers to track not just the health of their systems, but allows for predicting some future failures before they happen. This trend is entering the memory domain. With health metrics combined with telemetry processing, systems can correlate seemingly disparate factors such as device temperature, access patterns, correctable and uncorrectable errors, post package repair, and use long term logging procedures to connect the dots on these factors. This talk examines trends in adding metrology to systems to enhance system health and reduce costs.
Large-scale approximate nearest neighbor (ANN) search for high-dimensional vectors (> 1k dimension) at billion-scale datasets faces fundamental system bottlenecks. Distance computation is constrained by limited DRAM capacity and bandwidth, while frequent movement of vectors between storage, memory, and CPU leads to high latency, energy computation, and excessive memory footprint. These challenges restrict scalability and throughput for modern AI and retrieval workloads.We propose a storage-compute co-design approach that offloads distance computation to NVMe devices via an extended compute command. Instead of transferring full vectors to the host, the SSD performs L2 or cosine similarity calculations internally and returns compact returns, significantly reducing PCIe traffic and DRAM bandwidth pressure. By minimizing host-side data movement and compute overhead, this architecture achieves 2-3x search throughput improvement in large-scale ANN workloads while maintaining recall accuracy. Additionally, decoupling vector storage from host-resident processing reduces index memory occupation in graph-based ANN methods such as HNSW.
Defense drones lose satellite lock. Robots reboot mid-process. In contested rugged deployments, agentic AI power-cycles and wakes stateless—context erased, reasoning chains broken, mission aborted.
The industry misdiagnoses this as a software issue. It is a memory architecture failure: DRAM provides low-latency retrieval but volatilizes on power loss; NVMe preserves state but imposes latency that cripples real-time RAG.
This session validates a CXL 3.1-native persistent memory fabric for hostile edge AI. By pooling DRAM with byte-addressable pSLC storage, we ensure zero-loss agent state. Telemetry from 2025 defense airborne exercises shows AI agents sustaining reasoning chains through 40+ power events with zero context loss.
Context retrieval latency drops from 340ms to 80ms. FPGA-based computational storage relocates KNN search adjacent to embeddings, achieving 200GB/s throughput and collapsing PCIe saturation from 89% to 31%. The fabric maintains performance in sealed, conduction-cooled systems from -40°C to +85°C.
Attendees receive a validated reference architecture—with CXL topology and thermal models—to deploy resilient agentic AI where conventional systems fail.
CXL memory is emerging as a foundational building block for next-generation data center systems, but scaling capacity cost-effectively requires simultaneous advances in reliability, availability, serviceability (RAS), and total cost of ownership (TCO). This talk presents two complementary innovations targeting CXL memory expansion. First, we introduce a DRAM-oriented Reed–Solomon list decoding architecture that goes beyond minimum-distance decoding without inventing new ECC codes, significantly strengthening DRAM fault tolerance while maintaining ultra-low latency and high throughput.
Second, we present an in-line, data-dependent adaptive compression architecture designed for CXL.mem devices that expands effective memory capacity while keeping read latency well below 250ns and avoiding CPU overhead. By reducing intra-CXL DRAM read/write amplification and improving effective bandwidth utilization, this approach lowers DRAM footprint and system cost without compromising performance. Together, these innovations outline a practical path for improving both RAS and TCO in future CXL memory systems.
Supply chain risk has undergone a significant paradigm shift in 2025. Tariffs, increasing cyberattacks, and unprecedented global interconnectedness are redefining the requirements for resilience. Technological advancements now provide enhanced visibility and speed; however, they also broaden the attack surface, whereby even a minor sub-supplier can compromise an entire OEM.
As organizations increasingly depend on artificial intelligence, cloud infrastructure, and intricate digital networks, each new capability introduces additional vulnerabilities. In this context, siloed or incidental risk management approaches are insufficient. To maintain resilience, enterprises must adopt a comprehensive, organization-wide perspective of risk—integrating supplier, operational, and cyber vulnerabilities across the entire supply network. Failure to adapt accordingly may result in swift cascading disruptions, jeopardizing profit margins, operational continuity, and long-term growth. To ensure sustained business operations, organizations must adapt to the escalating threats posed by a globally interconnected environment. Additionally, it is imperative for companies to implement a novel strategy
Session Reserved for KIOXIA America
Session Reserved for Samsung Semiconductor Inc
Hyatt Regency Hallway, Mission City Ballroom Lobby
Session Reserved for SK Hynix
Session Reserved for Fadu Technology
Session Reserved for Silicon Motion Inc
Session Reserved for Micron Technologies
Exhibition opens in the main exhibition halls.
High Bandwidth Memory (HBM) is a critical technology in many processors running the latest LLMs and Generative AI applications. In this PDS tutorial we will cover the case for both existing and novel techniques for interfacing DRAM such as HBM to the predominantly non-Von-Neumann compute architectures found in GPU/NPU/TPU (collectively, xPU).
Attendees will learn the key aspects of HBM, including its history, architecture, and market trends, as well as a brief comparison to other popular DRAM memory types such as DDR, LPDDR, and GDDR. As much as NDAs allow, we'll cover emerging memories like Standard Package HBM (SPHBM), Custom HBM (cHBM), High Bandwidth Flash (HBF), and new modules like SoCAMM. We'll use this foundation to discuss AI processor architectures and how they use RAM including the impact of arithmetic intensity, quantization and sparsity on DRAM access. Finally we'll cover architectural techniques for improving memory access in AI applications ("breaking the memory wall"), including improving bandwidth, moving compute closer to RAM and moving RAM closer to compute.
The industry has been dealing with “The Memory Wall” for a long time as the advancement of memory has failedtokeep up with advancements in processing.
The basic structure of theDRAMcore cell is fundamentally unchanged for the last 50 years, so what tricks has the industry pulledtokeepDRAMas the basis for all computing?
There are many markets for memoryfromcell phonestolaptopstoedge communications nodestomainframe servers. How isDRAMconfigured differently so that the same memory chip can be used by all these markets?
Workshop to be determined
Workshop to be determined
Workshop to be determined
Workshop to be determined
Opening Reception to be held in the Exhibition Hall.
Interested in submitting for an award? Please visit:https://www.terrapinn.com/conference/future-memory-storage/Awards.stm
Create your personal agenda –check the favourite icon
Registration open throughout the day in the Santa Clara Convention Center, First Floor.
Santa Clara Convention Centre, First Floor
Santa Clara Convention Centre, First Floor
Chair's Remarks
Chair's Remarks
Chair's Remarks
Chair's Remarks
Chair's Remarks
In semiconductor manufacturing, many day‑to‑day decisions follow formal rules, yet in practice they still rely heavily on individual engineer experience. Decisions such as interpreting marginal trends or tuning interacting parameters vary across engineers, leading to inconsistency and human error. Our goal is to reduce this variability by defining data‑driven, shared decision criteria, while also capturing how experienced engineers decide in real situations.To achieve this, we have been developing a stepwise roadmap toward an Autonomous Fab. We began with analytical models addressing pain points identified by field engineers, then introduced LLM‑based AI assistants using proprietary, domain‑specific fab databases with natural language access to process, equipment, and operations. Building on this foundation, we are now developing task‑oriented AI agents that combine LLMs, analytical models, and rule‑based logic to support judgment‑intensive tasks such as trouble lot handling and abnormal trend resolution. AI‑generated recommendations are reviewed by engineers, shifting their role toward validation and exception handling, with the long‑term goal of enabling a fully autonomous fab.
Given exponentially growing genomic data volumes, extensive efforts target accelerating genomic analysis. We identify a major bottleneck limiting genomic analysis accelerators: the data preparation bottleneck, where genomic sequence data is stored compressed and needs to be first decompressed and formatted before an accelerator can operate on it. To mitigate this bottleneck, we propose SAGe, an algorithm-architecture co-design for highly-compressed storage and high-performance access of large-scale genomic sequence data. The key challenge is to improve data preparation performance while maintaining high compression ratios (comparable to genomic compression algorithms) at low hardware cost. We address this challenge by leveraging key properties of genomic data to co-design (i) a lossless (de)compression algorithm, (ii) lightweight decompression hardware, (iii) storage data layout, and (iv) interface commands. SAGe integrates seamlessly with diverse genomic analysis accelerators, improving performance and energy efficiency of two state-of-the-art accelerators by 3.0x–32.1x and 13.0x–34.0x, respectively, compared to when relying on state-of-the-art SW and HW decompression tools.
In the high-stakes race of Generative AI, the industry’s spotlight has been fixated on the "engine"—the high-performance GPUs and massive memory arrays. However, even the most powerful supercar is rendered useless if its dashboard fails or its throttle sticks. In the world of 24/7 AI infrastructure, the Boot Drive has evolved from a simple startup accessory into a mission-critical component that functions as both the throttle and the nervous system of the server.
This session will explore the paradigm shift in NAND requirements driven by large-scale AI clusters. We will move beyond the "lowest cost" mindset to discuss why stability and advanced telemetry in boot media are now the primary safeguards against catastrophic downtime. By redefining the Boot Drive’s role, we can ensure that the massive investments in AI compute are protected by a foundation that is as resilient as the intelligence it supports.
Micron and the Department of Energy’s Pacific Northwest National Laboratory (PNNL) have collaborated to build a multi-host, 12-terabyte disaggregated, tiered, and shared memory prototype system with near memory compute capabilities.Recently deployed at PNNL in Richland, Washington, the system is supporting AI and broader research workloads of several DOE labs.
This presentation will describe the system topology and the key technical challenges encountered while building a reconfigurable shared-memory system. In particular, it addresses how CXL 3.x shared-memory features were enabled in a CXL 2.x environment. The presentation will also include the solutions developed to ensure memory coherency and the software required to run applications on the host processors as well as on the processors embedded in the memory devices. Some practical software performance acceleration numbers will also be presented.
Modern Solid State Drives (SSDs) are increasing the Indirection Unit (IU) to meet growing density and capacity demands. The Linux LBS framework already allows applications to fully utilize these devices and avoid performance-limiting read-modify-write operations. This work extends that foundation by enabling Linux to leverage new hardware atomic capabilities in SSDs, providing application-level data integrity guarantees to avoid torn writes without the overhead of software-based solutions. We are specifically targeting optimizations in PostgreSQL. By offloading atomicity to the device, we aim to reduce write amplification, improve data reliability, and increase performance.
As FMS marks its 20th anniversary, embedded ReRAM is reaching an inflection point of its own. With technology qualification complete, the focus has shifted from device validation to manufacturability, yield ramp, and new product introduction. This presentation examines what happens after qualification: array- and product-level yield optimization, variability management, reliability and test screening, and cost-per-bit reduction required for production deployment. Importantly, it looks at the move from array and chip demonstration to new product designs that will be introduced to production on tight schedules.We will discuss how post-qualification ReRAM programs are enabling integration into advanced logic and BCD platforms and supporting new MCU, analog, and edge AI designs where embedded flash no longer scales. The data and lessons presented reflect a transition from feasibility to commercialization, demonstrating that embedded ReRAM is no longer a laboratory technology, but a manufacturable, product-qualified NVM entering mainstream SoC development. ReRAM is no longer ‘emerging’; it is a production-ready replacement for legacy embedded NVM.
AI infrastructure growth is driven by surging computational demand from increasingly large & complex generative AI models as the industry shift to massive, compute intensive AI Infrastructure buildouts. Memory(DRAM, NAND) technologies are becoming essential drivers of AI infrastructure growth as increasingly complex AI accelerators demand higher performance, thermal efficiency, and bandwidth density. At the same time, shortages across semiconductor manufacturing, HBM/DDR memory, and storage devices (Flash SSD/HDD) are becoming increasingly critical as AI Infrastructure demand outpaces supply. This talk will provide key insights into technology driving forces and key supply chain dynamics enabling and impacting AI Infrastructure growth.
A novel three-dimensional (3D) capacitor-less dynamic random-access memory (1T-DRAM) array is proposed as a high-density and high-performance memory solution. The device incorporates a double-gate and stacked-channel architecture to realize a 3D 1T-DRAM array. This Paper has been accepted by IEEE transaction of Electron Device and will be published in 2026. The major advantage is listed as below:
As SSD capacities scale toward 245TB and beyond, traditional architectures that separate SCM and high-capacity flash across different devices face increasing challenges in bandwidth scaling, latency predictability, endurance efficiency, and system cost. This topic explores mixed-media SSD architectures that integrate a small, high-endurance SLC namespace alongside a large, high-density QLC namespace within a single drive, enabling storage systems to better align flash characteristics with real-world access patterns. It examines how modern data platforms can use the SLC tier for latency-sensitive metadata, write buffering, and small-block operations, while directing large, aligned writes to the QLC tier. With this approach, mixed-media designs reduce write amplification, improve sustained write efficiency at high drive utilization, and make high-capacity QLC viable for performance-sensitive environments.
Enterprises are rapidly experimenting with large language models (LLMs), yet many domain initiatives stall between proof-of-concept and reliable production deployment. The primary blockers are not model capability alone, but domain specificity, factual grounding, controllability, evaluation rigor, and operational governance—especially in regulated or high-stakes environments. This talk presents an end-to-end, tested blueprint for domain adoption of LLMs by combining three complementary pillars: (1) prompt engineering for rapid task alignment and controllable behavior, (2) parameter-efficient fine-tuning to encode domain style and reasoning patterns, and (3) agentic Retrieval-Augmented Generation (RAG) to ensure grounded, traceable answers at scale.
As AI services migrate to the network edge, client devices must deliver enterprise-level performance while operating within the constraints of standardized hardware frameworks. This presentation introduces NVMe Dataset Management (DSM) commands as a strategic communication bridge between the host and firmware to unlock this efficiency. By utilizing DSM to provide "hints" regarding data characteristics, standardized client SSDs can achieve the sophisticated data placement and endurance typically reserved for enterprise environments. We will present exclusive insights into the IO patterns of modern AI model transactions and demonstrate how DSM-hinted data informs firmware to optimize background management. Through a real-world PCIe Gen 5.0 SSD implementation, we will show significant improvements in WAF and sustained performance across diverse edge AI workloads. Attendees will gain actionable strategies for leveraging NVMe standards to meet the aggressive endurance and performance requir
As AI paradigms shift from computing-centric to data-centric, memory architecture innovation has become a critical imperative. Modern LLM services demand not only HBM’s extreme bandwidth but also unprecedented capacity expansion for KV Cache and RAG workloads.In this session, we propose a CXL-based Heterogeneous Memory Hierarchy to address these challenges. We introduce a Computing Offloading mechanism at the memory device level to minimize data movement between the CPU and memory, significantly improving latency and effective bandwidth utilization—common bottlenecks in large-scale AI inference.Furthermore, we present a scalable capacity strategy using CXL Memory Pooling to transcend individual node limitations through dynamic resource allocation. Moving beyond theory, we provide empirical evaluation data from real-world environments, proving enhanced AI inference performance and resource efficiency over conventional architectures. Building on our previously published research in IEEE (2025) regarding RAG optimization, we conclude with practical architectural guidelines for the next generation of data-centric AI infrastructure.
This presentation will provide an updated overview of emerging non-volatile memory (NVM) technologies and their progress toward commercial adoption. I will review the status of leading solutions – including MRAM, ReRAM, and PCM – with a focus on their technical maturity, scalability, and integration at advanced nodes. Particular attention will be given to embedded NVM adoption at technology nodes ≤28nm, especially in microcontrollers, automotive electronics, and edge-AI devices, where endurance, power efficiency, reliability, and security requirements are reshaping memory choices. The discussion will combine Yole Group’s latest market analysis with insights from reverse engineering studies of embedded emerging NVM devices, offering a unique perspective on real silicon implementation, supplier positioning, and key market inflection points over the coming years.
The explosive growth of AI and data centric computing is pushing DRAM technology to its physical and architectural limits. Future workloads demand orders of magnitude increases in memory density, bandwidth, and energy efficiency—requirements that conventional planar 1x/1y/1z nm DRAM scaling can no longer meet. This talk examines emerging device and integration approaches including 4F² cell architectures, vertical channel transistors (VCT), capacitor less DRAM, IGZO based channels, and 3D stacked DRAM as candidates to extend scaling. While these technologies offer promising pathways to meet AI driven memory performance needs, they also introduce challenges in data retention, cell variability, interconnect scaling, thermal constraints in 3D integration, and manufacturability at high yield. The session highlights the critical device, process, and architecture inflections necessary to enable the next generation of DRAM capable of supporting accelerated AI, HPC, and data center workloads.
While the media cost of QLC is still significantly higher than that of HDD, the cost difference at the system level is smaller. One factor often overlooked in the system level comparison is the cost associated with servicing IO. HDD and QLC have very different levels of performance as a function of IO size, and this difference can measurably reduce the cost of QLC.In this talk we will show the throughput characteristics of both technologies as a function of IO size. IO size histograms will be shown and combined with the throughput curves to compute the storage capacity required to support the histograms. The results will quantify the reduction in QLC cost due to the difference in performance vs. IO size for QLC and HDD.
The storage industry is entering a period of rapid transformation as near line HDDs, QLC SSDs, and emerging high capacity solid state alternatives respond to unprecedented pressures from AI and tightening global supply. AI inference workloads are generating massive volumes of warm and cold data, driving severe near line HDD shortages with lead times now exceeding 52 weeks, with users locking in drive supply more than a year in advance at times. This shortage, coupled with limited HDD manufacturing expansion is causing hyperscalers to accelerate the adoption of high capacity SSD solutions even for cold data tiers. As a result, AI driven demand and supply chain constraints are reshaping the near line segment, with HDDs expected to remain a key part of the landscape for the near term, but as QLC capacity grows, will there be “Near Line Killers” that displace HDDs? We will examine the trade-offs and trends and discuss possible scenarios for future.
Artificial intelligence has shifted the center of gravity in computing architecture. Performance is no longer defined solely by processor capability, but by how efficiently data moves across accelerators, memory tiers, and storage domains. As models scale from billions to trillions of parameters, traditional boundaries between compute, memory, storage, and networking are dissolving.
This talk examines how emerging AI workloads are reshaping platform design from the rack outward, highlighting the technical and economic forces driving composable memory, accelerator fabrics, and open interconnect ecosystems. It explores why cohesive, multi-vendor system design across silicon, fabrics, and software is essential for sustainable scaling.
Attendees will gain a system-level view of current architectural limits, emerging design patterns, and the critical role open, high-performance interconnects will play in next-generation AI infrastructure.
Agentic AI workloads increasingly require near-GPU storage access with ultra-fine-grain I/O, often well below the conventional 4 KiB block size. Emerging GPU-initiated storage models highlight the need to sustain very high IOPS while efficiently utilizing PCIe Gen6/Gen7 bandwidth, potentially with fewer SSDs per GPU domain.However, mainstream data-center SSDs are architected around 4 KiB host I/O assumptions, targeting millions of IOPS per device at higher PCIe generations. When host I/O sizes shrink to 512 B or smaller, overall throughput degrades sharply due to PCIe command and completion overheads, interrupt pressure, and limited per-I/O processing capacity within the SSD controller. As a result, achieving even a modest fraction of peak PCIe bandwidth under fine-grain I/O becomes a fundamental architectural challenge.This presentation introduces a set of multi-level I/O coalescing techniques in next-generation SSD controllers designed to address these bottlenecks. The proposed approach spans NVMe queue-level aggregation and command-level coalescing, substantially reducing PCIe transaction overhead per effective I/O while preserving host-visible fine-grain I/O semantics.
As the scale and complexity of enterprise SSD deployments grow, log-based debugging and telemetry support have become increasingly important. Conventional SSD logging solutions are largely “passive”: functional modules rely on a shared logging resource and depend on best-effort servicing. As a result, missing critical log entries is common, forcing developers to rely on speculation to reconstruct behavior, which significantly impacts debug turnaround time.This presentation first examines common logging approaches used in prior SSD designs, then introduces a novel framework from SanDisk Enterprise SSD development that promotes a “proactive” and “plan-ahead” philosophy. Each key SSD function or module owns a distinct logging identity and registers its requirements at the system level. These identities are grouped by feeding characteristics, and log events are collected and stored in isolation through the underlying debug infrastructure during the drive’s lifetime. This approach balances log collection across differing consumption rates and optimizes overall logging storage requirements. We believe the design can greatly improve the the quality and robustness of the telemetry logs.
CXL-SUBLET is a key-value cache management system for multi-user LLM serving that keeps KV blocks as session state across GPU HBM and a shared CXL.mem pool. It uses session leases to move idle-session KV blocks from HBM to CXL.mem, and budgeted hydration to restore only selected blocks within a fixed resume-time budget, avoiding bulk transfer stalls and controlling CXL bandwidth. CXL-SUBLET maintains a CXL-resident directed acyclic graph (DAG) that stores shared prefix KV blocks once and reuses them at block granularity across users and replicas, while isolating each user’s subsequent tokens via copy-on-write. In our 7B FP16 setup, cache cost is 512KB/token (2GB per 4K-token session), enabling ~500 idle 4K-token sessions per TB of CXL.mem- ~12× higher idle-session retention capacity than an 80GB HBM-only configuration (~40 sessions)—while reserving HBM for active compute. Our evaluation is SLO-driven: we report max concurrent sessions per GPU while meeting resume targets (P95≤200ms, P99≤400ms), along with prefix hit rate, KV bytes restored per resume, and CXL data movement (total bytes moved), which together capture transfer overhead under the SLOs.
AI induced ramp in flash consumption has significantly increased QLC footprint in data centers. While QLC provides much better read performance as compared to HDDs, its write performance is significantly lower. Any write amplification, specially around high utilizations, further degrades write I/O bringing it very close to HDD.
FDP has been proposed in the past for reducing flash write amplification. It ha been proving very useful for QLC media as any saving in WAF directly translates into precious write bandwidth for applications. The presentation will talk about some recent data on what has been the most effective way to use FDP and how much WAF improvement to expect.
In this work, we experimentally demonstrate that it is possible to generate true random numbers at high throughput and low latency in commercial off-the-shelf (COTS) DRAM chips by leveraging simultaneous multiple-row activation (SiMRA) via an extensive characterization of 96 DDR4 DRAM chips. We rigorously analyze SiMRA's true random generation potential in terms of entropy, latency, and throughput for varying numbers of simultaneously activated DRAM rows (i.e., 2, 4, 8, 16, and 32), data patterns, temperature levels, and spatial variations. Among our 11 key experimental observations, we highlight three key results. First, we evaluate the quality of our TRNG designs using the commonly-used NIST statistical test suite for randomness and find that all SiMRA-based TRNG designs successfully pass each test. Second, 2-, 8-, 16-, and 32-row activation-based TRNG designs outperform the state-of-theart DRAM-based TRNG in throughput by up to 1.15x, 1.99x, 1.82x, and 1.39x, respectively. Third, SiMRA's entropy tends to increase with the number of simultaneously activated DRAM rows.
Understanding computer memory architectures and tiers starts with analyzing the internal workings of the memory devices to understand how they store information, the challenges with maintaining that information, and how to get that data in and out of the attached system. As data rates have increased, new techniques have been developed to ensure data reliability and keep power under control.The majority of memory is assembled onto carriers called DIMMs which come in a variety of configurations based on the application. This training will compare and contrast the families of DIMMs.Takeaways from this session:
In modern data centers, workloads are highly mixed and multi-tenant, yet mainstream enterprise SSDs remain overly generic: they passively accept host IO requests without fine-grained responses to host-side priorities or access patterns. Meanwhile, existing FDP mechanisms, though more flow-aware, require heavy protocol and software-stack changes on the host, making them hard to adopt at scale. NeoHint Storage/SSD aims to let drives “understand” host behavior and proactively optimize data layout and resource scheduling with minimal integration cost. It introduces a lightweight hint channel and a redesigned firmware architecture that jointly reshape data-flow and control-flow isolation. On the data side, it supports fine-grained placement across mixed media (SLC/TLC/QLC) and physical isolation at the NAND die/chip level; on the control side, it slices controller compute resources, queues and caches to deliver deterministic IO and sellable SLAs under mixed workloads. POC results show metadata throughput gains up to 3–3.5×, with user throughput and tail-latency improvements typically in the 20%–50% range, and over 50% in some scenarios.
Client SSDs are rapidly moving to QLC-based storage, driven by demand for higher capacities and lower costs. It is predicted that within the next few years, over 70% of client SSDs will be powered by QLC.
Currently the higher latency and lower performance inherent to QLC is mitigated by leveraging one of 4 methods: SLC caching, SLC hybrid modes, host-based hints that can direct data to either the SLC or QLC tier, or a dedicated SLC namespace.
In this paper, we explore the tradeoffs between the methods, and introduce a new mechanism based on NVMe thin provisioning, which enables dynamic and reliable provisioning of SLC for performance-critical applications and page files, without requiring namespaces or a fixed allocation in advance of use. This innovation enables high-performance user experience on demand while retaining the capacity and cost advantages of QLC.
Large‑scale storage programs often rely on deterministic schedules shaped by manager judgment, creating bias and limiting accurate forecasting. As complexity grows across hardware, firmware, and manufacturing, this approach increases schedule risk, drives cost, delays revenue, and reduces alignment with customer expectations. Without a probabilistic method, organizations cannot quantify uncertainty or anticipate delays in long‑duration programs. Monte Carlo simulation replaces guesswork with a data‑driven model that uses best‑case, most‑likely, and worst‑case task durations through a PERT distribution, running thousands of iterations to produce probability‑based completion forecasts. This approach improves schedule accuracy, supports better decision‑making, and enables cross‑functional teams to commit to realistic, dynamic plans that reflect true variability in complex engineering environments.
As AI agents become increasingly autonomous and context-aware, they must manage dynamic memory, sensitive data, and tool execution — all while remaining secure and reliable. These “agentic” systems introduce new security risks, including memory poisoning, impersonation, context manipulation, and tool misuse. Traditional software-based defenses are insufficient, particularly when agents operate across distributed systems or interface with volatile memory and external APIs in real time.
This session introduces a novel, hardware-anchored security framework where X-PHY’s AI-embedded SSD acts as the foundational guardrail for the Model Context Protocol (MCP) — the structure that governs memory, context sharing, and decision-making in AI agents. We demonstrate how X-PHY’s firmware-based anomaly detection, immutable hardware identity, and real-time response mechanisms can enforce secure context transitions, validate agent provenance, and detect ransomware or data exfiltration attempts before they impact system integrity. Technical content will include MCP architecture patterns, SDK/API integrations with X-PHY, and a red-team-hardened agent design blueprint.
As block sizes in the underlying NAND media in SSDs increase to accommodate higher capacities there are tradeoffs that significantly change the cost structure for implementing small vs large unit data placement strategies. These tradeoffs have both performance and cost implications that are complex due to changes in WAF, SLC cache required and data protection schemes.In this presentation we examine the current techniques and requirements for small vs large unit data placement and apply those to assumed future NAND block sizes to understand the economics of leveraging the same techniques in the future.This analysis should help inform the industry of potential changes that should be made to accommodate their data placement infrastructure as drive capacities begin to exceed 256TB. This is important to begin aligning host data placement techniques with data placement available in future SSDs
As a “Composable Memory Ecosystem Driver,” this presentation includes the outline of a set of modular contributions to promote broad utility and adoption. It uses a jigsaw puzzle model to define input and output (North and South and sideways) APIs for various building blocks. It forms a Base Specification for an architectural blueprint of how various layers interconnect. It includes discovery and enumeration, CXL Fabric Manager, RAS module, Security module, Telemetry, Diagnostics, Dynamic Allocation Policy Manager, Memory Tiering, Memory Pooling, Memory Sharing "Objects,“ Guest OS, and applications. For ease of collaboration and open contribution, it promotes independent building block. This model is extensible to scale-up memory fabrics using different underlying Physical, Link, and Protocol layers. Based on this Base Specification, we expect the OCP community to contribute several Design and Product Specifications to streamline market adoption of these fabric technologies.
As high‑speed memory interfaces such as DDR, LPDDR, and HBM become ubiquitous in modern SoCs, increasing I/O speeds, reduced interface voltages, and multiple power rails have made analog behavior a critical aspect of memory verification. While true analog simulation is prohibitively resource‑intensive, purely digital verification is insufficient to capture real device behavior. This work presents an approach for modeling key analog features/trainings within digital memory models to enable early detection of design issues and improve robustness of DDR PHY/memory controller designs. The methodology incorporates digital representations of analog effects such as signal drift, data eye timing, dynamic voltage and frequency scaling , on‑die termination (ODT), Device Feedback equivalization (DFE), ZQ calibration, signal strength, training algorithms, and temperature‑based derating, implemented within C/C++‑based verification IP using standard VPI/DPI interfaces. These models address limitations in specifications and HDLs by providing configurable, timing‑aware checks that expose critical issues—such as incorrect impedance settings or insufficient eye margin missed by conventional model
Refreshment Break
Chair's Remarks
Chair's Remarks
Chair's Remarks
Chair's Remarks
Moderator and Panellists To Be Determined
Panellists to be determined.
Hyperscalers are increasingly deploying QLC flash in capacity-oriented tiers, leveraging software intelligence and workload segmentation to maximize efficiency at scale. In large fleet environments, QLC behavior is shaped less by device limitations and more by system-level orchestration, write management, and traffic smoothing. At scale, its performance profile stabilizes under controlled workloads, enabling predictable latency and cost efficiency. This reflects a broader trend where hyperscale architecture, rather than raw media characteristics, defines real-world flash behavior.
Kyungtae Kim is a Quality Engineer at SK hynix, specializing in SSD validation and reliability assessment. With extensive experience in enterprise SSD qualification, he has been evaluating SSD stability and performance in emerging datacenter environments. Recently, he has focused on the impact of alternative cooling solutions, such as liquid cooling and immersion cooling, on SSD reliability. Through comprehensive testing and analysis, he provides key insights into optimizing SSD performance for next-generation data centers.
Efficient memory utilization is critical for scalability and performance in HPC data centers and AI servers. Accurate memory monitoring serves as a foundational capability underpinning nearly all memory management and performance optimization frameworks. Recent revisions of the Compute Express Link (CXL) specification introduce the CXL Hot-range Monitoring Unit (CHMU), a standardized interface that enables detection of frequently accessed memory regions with minimal host performance overhead through configurable granularity, epoch control, and hotlist-based reporting.
In this presentation, we will first provide a concise architectural overview of CHMU, including its counter model, epoch configuration, and hotlist management mechanism. We will then introduce a systematic verification methodology that addresses complex configuration dependencies, corner-case analysis, overflow handling, and each reporting mode validation. Through practical error scenarios and validation strategies, this session outlines a structured approach to ensuring accurate, robust, and standards-compliant hot memory monitoring in CXL systems.
The rapid growth of artificial intelligence (AI) workloads has created a critical need for high-performance, accelerator-optimized networking. Modern AI clusters generate intense traffic between GPUs and specialized accelerators, where latency, bandwidth, and congestion control directly affect scalability. The Ultra Ethernet Consortium (UEC) advances Ethernet into an AI-ready fabric designed for deterministic, low-latency performance. Building on this, Ultra Ethernet Transport (UET) enhances transport efficiency for accelerator-to-accelerator communication. In addition, ESUN and UALink support efficient scale-up within accelerator domains while preserving seamless Ethernet-based scale-out across larger infrastructures.This paper explores why Ethernet is emerging as the strategic foundation for next-generation AI infrastructure and how these technologies collectively solve the accelerator connectivity challenge. It examines the performance limitations of conventional Ethernet in large-scale AI clusters and explains how UEC introduces architectural enhancements such as advanced congestion control, telemetry awareness, and deterministic behavior. The paper further analyzes
As AI workloads eat up the global memory chip supply, RAM pricing is surging and supply is tightening like never before—putting pressure on IT and engineering teams who are forced to choose between paying much more for their systems or suffering performance challenges. This talk explains why this market situation will only keep intensifying, and how MEXT Predictive Memory™ provides a timely, practical way to cut DRAM dependence and radically improve price-performance for memory-bound workloads.
In an era defined by AI acceleration and flash memory evolution, the demand for uncompromising, scalable data infrastructure has never been greater. Wiwynn and PEAK:AIO unveil a powerful collaboration: a next-generation, open architecture storage solution for AI and HPC environments.At the hardware foundation, Wiwynn’s ultra-efficient platform provides a high-throughput, low-latency architecture that maximizes the bandwidth of modern flash. Layered on top is the PEAK:AIO Data Server and Open pNFS, an open-source and open standards-based storage stack, optimized to harness full RDMA throughput with low-latency and line-rate performance that scales linearly with number of nodes.Initial test results demonstrate full wire-speed performance proving that commodity servers, when expertly engineered can rival and even surpass legacy, high-cost storage appliances.
Trust is earned when systems behave predictably, failures are explainable, and user data remains protected. This presentation explores how telemetry functions as a core trust enabler by delivering human readable diagnostics without exposing end user data.
The session highlights how accessible, cross platform telemetry reduces friction in issue resolution and enables a shift from reactive troubleshooting to proactive failure prediction. We will discuss approaches for pulling and decoding logs across environments and announce ongoing work to port NVMe CLI to Windows, aligning tooling with industry standards and improving consistency across platforms.
Finally, the talk outlines potential expansions to human readable telemetry, including richer context, trend indicators, and actionable guidance. These enhancements aim to improve transparency, accelerate decision making, and strengthen trust not only when failures occur, but by helping prevent them—making reliability the default experience.
CXL Memory Pooling and Memory Sharing introduce multi-host, Fabric-Manager-driven dynamic capacity management — and with it, verification challenges that traditional single-host, point-to-point testbenches cannot catch. Issues such as shared-region coherency hazards under concurrent host access, data integrity violations in shared memory, and incorrect handling of dynamic allocation transitions are subtle yet critical. These gaps can silently pass through simulation and surface only in silicon.This paper presents a verification methodology designed to expose these hard-to-find bugs. We describe a deployment-realistic verification architecture for LD-FAM devices combining CXL Host, CXL Switch, and CXL Fabric Manager VIPs that mirrors actual multi-host data center topologies. We then walk through targeted verification scenarios: for Memory Pooling, dynamic partition allocation and de-allocation across hosts; for Memory Sharing, shared-region data integrity under concurrent multi-host access, rejection of invalid capacity requests, and coherency management flows that validate functional correctness beyond protocol compliance.
As AI datasets continue to scale exponentially, PCI Express® (PCIe®) technology serves as the high-bandwidth, low-latency connectivity backbone of advanced AI platforms. The requirements of AI clusters are shaping PCIe technology evolution as the primary interconnect for memory access between processing elements, such as CPUs, GPUs, or accelerators. PCIe technology offers developers the low-latency, scalability and backwards compatibility needed to support today’s compute-intensive AI applications.This session will detail the features of the PCIe 8.0 specification, planned for release in 2028 and targeting 256.0 GT/s raw bit rate (up to 1 TB/s bi-directionally via a x16 configuration), and how PCI-SIG’s continued doubling of the data rate allows AI chipset vendors and AI accelerator developers to maintain a clear path for growth today and into the future. Attendees will also learn about the industry’s first standards-based PCIe optical solutions, and how they enable extended reach across racks and pods in AI, cloud and data center applications.This session will update attendees on how the PCIe 8.0 specification will benefit diverse applications and support data-intensive markets.
As LLM context windows scale to millions of tokens, the KV cache working set outgrows GPU HBM and spills to storage. Most current deployments treat this as a memory hierarchy problem and reach for DRAM or NVMe as a drop-in buffer — but this approach leaves most of the performance available in modern flash devices on the table. Fundamentally, this is a tiering problem: as the KV cache working set exceeds HBM capacity, the system must tier data across HBM, DRAM, and flash — and the efficiency of that tiering determines end-to-end inference latency at scale.
This talk examines the mismatch between how inference engines access KV cache — fine-grained, latency-sensitive, highly random at the page level but with exploitable locality at the sequence level — and how commodity NVMe is typically driven. We explore a set of optimisations that close this gap: access pattern reshaping to align with flash, optimising the network data path, parallelising across NVMe and reducing I/O amplification through smarter KV cache eviction policies.
We present results using common AI inference frameworks and growing use cases and outline what a flash-native KV cache fabric should looks like in practice
As we build huge AI clusters spanning multiple cities and several exabytes of storage, managing IO capacity becomes an impossibly complex task. Workloads vary from 100s of millions of small reads in few KiBs to 100s of thousands of huge write bursts in several megabytes. Further media like QLC have imbalanced read to write ratios which makes it even more confusing to uniformly represent I/O. Several AI teams actively share the same storage clusters often pushing its limits on both space and IO which then requires the storage cluster to continuously grow often leading to ongoing imbalance in space and IO.
Meta has been operating at the forefront of AI research, leading innovations in not just AI but systems and storage design to serve the growing AI research needs. Storage clusters in Meta have grown to operate at 10s of exabyte scale with heterogenous hardware across both TLC and QLC flash. This presentation will dive into the details of uniform representation of IO capacity and capacity modeling, overload protection and multi-tenancy at scale.
We take a technical view of strategies for protecting large capacity SSDs and maximizing their usable lifespan. Large capacity SSDs may soon approach Petabyte capacity. This presentation looks at managing die failures with Depopulation and overall protection strategies. We compare HDD Depopulation with failed platters representing a contiguous LBA space to SSD and FTL where a die failure may be scattered across an LBA range. Depopulation is a standard and used with HDDs. We explore SSD recovery methods looking for a sensible way to Depopulate portions of an SSD. Recovery aspects include reduced time to full protection and smaller depopulations.
As CXL transitions from early adoption into broader deployment, the industry lacks a consistent framework for evaluating device readiness. Without shared validation criteria, the ‘out-of-box’ experience for CXL devices remains ill-defined, with customers forced to establish their own methodologies, complicating interoperability assessments and slowing ecosystem maturation.
In this talk, we evaluate the XCENA MX1 CXL device with the OCP CMSBench workload suite. CMSBench is a (hardware and software) community-driven effort to establish a reproducible, vendor-neutral evaluation platform for native and pooled memory systems (like CXL) via a common, instrumented set of evaluation criteria. We present results from the workload suite across a range of host and device configurations, sharing test logs, validated hardware and software Bills of Materials (BoMs), and configuration guidance developed through this process.
Attendees will leave this talk with two major takeaways: One, that the MX1 is a production-ready CXL device tested over a wide range of scenarios, and two, that the CMSBench is well-suited to serve as a community-wide baseline for CXL device validation
Unordered I/O (UIO) in PCI Express introduces a paradigm shift from traditional strong ordering, enabling higher bandwidth and lower latency for demanding applications. While UIO offers significant performance gains, it also presents unique verification and integration challenges.
This presentation delves into essential verification strategies for UIO, such as managing split completions, handling mixed PR-FC/NPR-FC traffic, configuring Virtual Channels (VCs), and ensuring secure operation with Integrity and Data Encryption (IDE). System-level considerations are addressed, including the risks of mixing UIO and non-UIO flows, particularly in peer-to-peer and multi-link environments.
Attendees will learn about the benefits and complexities of UIO, discover practical verification techniques, and receive actionable guidelines for integrating UIO into next-generation PCIe systems, ensuring robust and efficient adoption of this advanced feature.
As Large Language Models (LLMs) push context windows into the millions of tokens and serve growing numbers of concurrent users, memory capacity has become a dominant constraint for scalable inference. This talk presents findings demonstrating how larger‑capacity LPDDR, when paired with HBM in unified memory architectures such as NVIDIA’s GH200 systems, enables efficient key‑value (KV) cache offload. The result is a substantial improvement in inference scalability and responsiveness, driven by LPDDR’s combination of large capacity, low power, and high efficiency.The talk will present quantitative results highlighting how LPDDR‑backed KV‑cache offload increases achievable context length, boosts concurrent user throughput and supports a larger number of simultaneous clients—all while benefiting from LPDDR’s inherently superior energy efficiency.
AI data centers are shifting from GPU-centric compute scaling to memory-centric system design. The rise of long-context models, persistent agents, and reinforcement-learning workflows is reshaping infrastructure requirements. Inference is no longer a stateless, token-by-token task—it is a distributed, multi-tier memory problem spanning HBM, host DRAM, NVMe, and network fabrics. Innovations such as disaggregated prefill/decode, KV cache offload, LMCache, and NVIDIA’s Dynamo/SCADA architecture signal a broader shift toward GPU-initiated storage and hierarchical memory orchestration. This session explores how product leaders and architects must rethink memory, storage, and network co-design to support agentic, persistent, and diverse AI communication workloads at scale.
Artificial‑intelligence workloads are rapidly migrating to smartphones, laptops, and other edge devices, and LPDDR6 will provide the high‑speed, low‑power memory needed for today’s models. To meet the ever‑growing demand for bandwidth, Processing‑In‑Memory (PIM) extends LPDDR6 by embedding processing units that specialize in GEMV (General Matrix Vector Multiplication) next to DRAM banks. This reduces data movement and delivers both higher performance and better energy efficiency than an NPU‑only solution.The session will illustrate the performance and energy‑efficiency advantages of PIM compared with traditional methods, introduce the industry partners (SoC and memory vendors) involvedAttendees will gain a clear understanding of how PIM—offering higher memory bandwidth—can serve as an effective solution for the next generation of on device AI.
At scale, storage failures are inevitable — outages are not. This talk focuses on architectural approaches that reduce operational disruption when drive failures occur, particularly in hyperscale environment supporting mission-critical workloads. We will discuss techniques for fault isolation, background rebuilds, and intelligent data management that allow systems to recover from device or component failures without service interruption or excessive traffic across the network.
Write amplification is the fundamental enemy of SSD performance and reliability. The SSD industry has offered several solutions for managing write amplification through data placement initiatives such as Zoned Namespaces and Flexible Direct Placement. These technologies work well for tightly integrated infrastructure. Nonetheless, many application developers cannot rely on the availability of these advanced capabilities. This talk will review the fundamental origins of write amplification and discuss software techniques for managing write amplification that account for SSD internals, but do not rely on anything beyond mandatory IO command sets to achieve lower write amplification.
Processing‑Near‑Memory (PNM) computing mitigates memory‑bandwidth constraints in heterogeneous systems by attaching a CXL‑enabled PNM accelerator that offloads the vector‑similarity search of Retrieval‑Augmented Generation (RAG) pipelines. The design, implemented on an Intel Agilex 7 I‑Series FPGA‑SoC with a quad‑core ARM Cortex‑A53 CPU, DDR4 memory, and a CXL 2.0 (Gen‑4) interface, the design follows a dual‑scope execution model, a host‑resident orchestration kernel performs coarse index partitioning, while device‑resident fine‑search kernels execute highly vectorized, memory‑bound inner‑product/L2‑distance calculations directly on the CXL PNM Device. This approach leverages a CXL PNM hardware-adapted FAISS configuration and on-device vector read to compute the similarity search. Analytical evaluation on representative RAG workloads predicts a 3.32× speedup over CPU + CXL memory‑expander baselines and confirms a 100 % F1‑Score for nearest‑vector retrieval, validating the CXL‑based PNM micro‑architecture and its dual‑scope offload strategy for scalable acceleration of memory‑intensive RAG retrieval tasks.
This session walks through the architecture and implementation of Windows' new NVMe over Fabrics initiator, covering TCP and RDMA transport support.We'll cover the design and architecture of how the initiator fits within the modern Native NVMe storage stack to achieve near-local performance, how ANA-based multipathing is implemented, and how the initiator surfaces through PowerShell for discovery and management. We'll also cover interop considerations with existing NVMe-oF targets and what storage developers need to know to validate and optimize their solutions against the Windows initiator.
The session covers the market opportunities for NL-SSDs, what the ecosystem requirements are likely to be, standards, TCO overview, and data protection strategies. Requirements differences compared to mainstream SSD focus on power and NL I/O patterns. Compared to HDD, advantages include significant power savings, higher density per rack unit, and improved performance. Projected capacities represent a large increase compared to existing drives in this space making it necessary to look at data protection strategies.
Refreshment Break
Session Reserved for NEO Semiconductor
Session Reserved for Sandisk Technologies, Inc.
Exhibition opens in the main exhibition halls.
Hyatt Regency Hallway, Mission City Ballroom Lobby
Session Reserved for Microchip Technology
Session Reserved for Longsys
Session Reserved for Marvell Technologies
Chair's Remarks
Chair's Remarks
Speakers to be determined.
Chair's Remarks
Chair's Remarks
Chair's Remarks
The AI ecosystem has successfully scaled past what can be supported with DRAM. We will show that inference performance can be greatly improved and the DRAM footprint can be reduced with the introduction of a large Flash tire. The presentation will highlight the benefits and explain the experiments used to validate this claim. The advantages apply to client solutions all the way up to full enterprise deployments.
DAOS (Distributed Asynchronous Object Storage) is an open source scale-out storage system that is designed to support massively distributed NVMe storage in user space. It is a key component of the Aurora exascale system, delivering high storage throughput and low latency to application users through both traditional filesystem interfaces and powerful key-value based APIs for dataset management. The HPE Cray Supercomputing Storage System K3000 embeds DAOS 2.8 software running on HPE ProLiant DL360 Gen12 servers into a factory-tested and fully integrated storage system. This session highlights the hardware design choices to maximize the achievable I/O performance of the K3000 solution, and it analyzes K3000 bandwidth and IOPS performance on 400 Gbps fabrics for various CPU and NVMe configurations. It also discusses the available options to control power consumption on the node level.
AI-driven data centers impose extremely strict Quality of Service (QoS) and performance requirements on storage systems, creating a significant challenge for SSD vendors tasked with optimizing diverse product portfolios across multiple hyperscalers, each with distinct workloads, hardware configurations, and power constraints. To address this complexity, we present a reinforcement-learning based mechanism that autonomously optimizes SSD performance and QoS using a DEEP Q-Network (DQN). The system iteratively observes the current SSD firmware parameter configuration, proposes targeted adjustments and applies them before executing representative workloads. The plan is to present how Athena operates in practice and the tangible performance and QoS improvements that were seen on a real SSD hardware. Manual SSD tuning time was reduced by more than 50% and read latency improved by up to 67%, and random mixed workloads saw as much as 19% higher IOPs. The discussion will also cover why reinforcement learning is well-suited for SSD tuning, what tunable parameters exist, and forward-looking view on integrating this capability directly into SSD controller for scalable deployment.
NVMe technology has become the language of storage and is now synonymous with high-performance storage and with widespread adoption in client, cloud, enterprise and event AI applications. Although initially developed for direct-attached PCIe® SSDs, NVMe architecture is now widely used in both direct-attached and fabric-attached applications.
This presentation provides an overview of the NVMe standards roadmap and reviews the newest NVMe features like NVM Subsystem Migration, Quality of Service and more. Finally, we will review how NVMe technology will support emerging applications like AI in additions to how it continues to support Cloud and Enterprise Applications.
A new type of device called a quantum memory is being researched for use in quantum communications and other applications. It is completely different from the memory devices we are used to because it stores qubits instead of classical bits. This presentation will provide an overview of the technology and physics used to build it, an update on the current development status, and a descripton of the potential application areas where these can be used in the future.
Processing-using-DRAM (PuD) is a promising paradigm for alleviating the data movement bottleneck using DRAM's massive internal parallelism and bandwidth to execute very wide operations. In this paper, we present the first characterization study of read disturbance effects of multiple-row activation-based PuD (which we call PuDHammer) using 316 real DDR4 DRAM chips from four major DRAM manufacturers. Our detailed characterization show that 1) PuDHammer significantly exacerbates the read disturbance vulnerability, causing up to 158.58x reduction in the minimum hammer count required to induce the first bitflip (HCfirst), compared to RowHammer, 2) PuDHammer is affected by various operational conditions and parameters, 3) combining RowHammer with PuDHammer is more effective than using RowHammer alone to induce read disturbance error, e.g., doing so reduces HCfirst by 1.66x on average, and 4) PuDHammer bypasses an in-DRAM RowHammer mitigation mechanism and induces more bitflips than RowHammer. To develop future robust PuD-enabled systems in the presence of PuDHammer, we adapt and evaluate the state-of-the-art RowHammer mitigation standardized by industry, called PRAC.
In high-density QLC deployments, maintaining optimal read thresholds becomes increasingly challenging as NAND flash technology scales to higher densities and 3D architectures, due to QLC’s narrow voltage margins and heightened sensitivity to temperature variation and wear-induced drift. Traditional static threshold schemes fail to adapt to dynamic conditions such as temperature variations and wear-induced shifts. This work introduces Adaptive Read Thresholds (ART), an AI-driven solution leveraging compact machine learning models to predict optimal thresholds in real time. ART combines offline training with online inference using Gradient Boosting Trees, leveraging binary symmetrical tree models, achieving tens of nanosecond-scale latency per prediction. Large-scale experiments on SanDisk SSDs demonstrate significant reductions in bit error rates and improved endurance compared to legacy Table & Tagging methods. ART’s lightweight architecture enables ASIC integration, paving the way for state of the art next generation QLC storage systems with enhanced reliability and performance, enabling intelligent storage with real time inference.
Top-tier IO500 results are typically associated with purpose-built appliances and large, complex deployments. This talk challenges that assumption through the Helma supercomputer at Friedrich-Alexander-Universität (FAU): a production environment that reached #3 on the IO500 Production list using commodity PCIe Gen5 NVMe drives and "cluster-in-a-box" servers, delivering strong performance across a ~5 PB Lustre-based cluster with a comparatively small hardware footprint.We'll walk through the tuning and validation methodology behind the result — BIOS choices, kernel/OS parameters affecting PCIe and NVMe throughput, and Lustre 2.16.1 configuration for both bandwidth and metadata performance. Outcomes include ~1.8 TB/s sequential reads, 800+ GB/s sequential writes, and ~8.2M metadata stat ops/s. Attendees will leave with a replicable checklist for building efficient, high-performing production storage on broadly available hardware and open-source software, plus an honest account of trade-offs and what we'd do differently.
Modern SSD controllers employ a wide range of power management techniques, including clock and activity control, dynamic voltage and frequency scaling, power gating, and adaptive voltage optimization. These techniques are highly effective and essential, and in practice, achieving optimal power efficiency requires applying all of them aggressively. However, as NAND flash operations dominate system power consumption, especially under high parallelism, precise control of NAND power becomes increasingly critical and difficult to achieve using local, reactive mechanisms alone.This talk presents a system-level power shaping approach centered on NAND power control through a centralized power budget engine. In this model, baseline I/O-related power is treated as a steady background load, while array activity power—driven by read and program operations—is explicitly managed as a constrained resource. By shaping how and when NAND operations consume this budget, the controller can regulate total power behavior without interfering with existing circuit- and block-level power-saving techniques.
This presentation would discuss the soon to be ratified TP4184, Host Addressable SLM NVMe feature. Attendees would learn how the feature works and details of the ratified specification. The presentation would include example use cases for this feature and the benefits that can be achieved.
This session explores how NVM Subsystems can be fully virtualized using Exported NVM Subsystem Templates. Combined with the Exported NVM Subsystem capability, this flexible concept allows a host to fully control the representation and behavior of controllers and namespaces within an NVM Subsystem. This mechanism enables Virtual Machine Managers and Hypervisors to relinquish control over the Admin Queue, paving the way for improved performance and Confidential Computing.
Quantum computing is moving into data centers, but its memory stack looks fundamentally different. Quantum memories operate at telecom wavelengths, interface with photonic interconnects, and require cryogenic or hybrid packaging environments. Startups are developing telecom-integrated quantum memory modules designed for long-coherence storage and quantum networking , not byte-addressable DRAM semantics.
Should classical memory and storage providers care? Can DRAM, NAND, MRAM, and CXL vendors extend into quantum control, buffering, or hybrid integration , or is this an entirely new materials and fab ecosystem?
This panel debates whether quantum memory becomes a niche scientific layer or a parallel memory hierarchy with its own fabs, packaging, and telecom integration requirements. As AI and quantum converge, the question is strategic: adapt, partner, or risk irrelevance?
Source-synchronous I/O buffers must determine optimal sampling delays to capture strobe-sampled data reliably. Current training algorithms depends on brute force exhaustive training sweeps that offer no analytical insight. This paper derives an intuitive parameterize template for computing optimal sample point delay from three intuitive timing components—duty cycle centering, channel skew, and receiver offset. The formula uses technology-agnostic symbols and applies to any source-synchronous interface used in typical DRAM based memory subsystems (DDR5, LPDDR, GDDR, HBM) by substituting the relevant specification parameters. We also present specific examples of usages of this approach for Validation on DDR5 Data Buffer configurations from DDR5-3200 through DDR5-12800 that demonstrate accuracy within 2 delay units of empirically trained values.
Beyond theoretical contribution, the framework offers immediate practical value: verification and design engineers can directly program the computed delay to bypass training entirely in simulation or first silicon bring-up, use it as an optimal training seed to reduce sweep range by more than half and convergence time by ∼60%.
Benchmarking and characterization of storage for AI continues to be a challenge across the industry. There are broadly available tools for executing benchmarks and a broad array of workload definitions. The problem we face is understanding which workload is important to customers, integrators, and product teams.
To address some of these challenges, SNIA has launched a new Technical Working Group (TWG) -- the AI Data Workloads TWG. This TWG was developed to provide definitions of AI storage workloads and the associated SNIA software to run a workload synthetically. This will enable standardization of workload definitions for suppliers, developers, and architects who are designing the next generation of AI Data Centers.
Attendees will leave this session with an understanding of the AI Data Workloads charter, how they can use the content produced by the TWG, and how to become involved in the TWG.
HDDs continue to be the backbone of data‑center storage, holding roughly 70% of all deployed data today, while SSDs store around 20% and tape the remainder. With global data creation growing at more than 25% annually through 2030, HDDs will remain the only economically scalable capacity tier. However, the rapid increase in drive density is widening the gap between capacity and performance. While most drives shipped today are under 30 TB, the industry is targeting 100 TB by the end of the decade. Yet sustained throughput remains near 200 MB/s—meaning a full‑drive read can exceed six days. Capacity per drive is rising quickly, but bandwidth per terabyte, IOPS per terabyte, and even power per terabyte continue to trend downward.This talk explores what architectural changes are required to keep HDDs viable as multi‑petabyte building blocks. It will introduce how Marvell is developing new forms of internal parallelism and advances in ML/DSP‑based signal processing can offset the growing performance deficit, and how these techniques reshape the design space for next‑generation HDD‑based storage systems.
The storage industry has long anticipated the moment when NAND flash could dethrone Nearline HDDs for large‑scale capacity storage, yet the crossover always seems just out of reach. Increasing flash density is one of the strongest accelerators: the higher the share of the BOM devoted to QLC bit cells, the more competitive the cost becomes. This raises a key question: how can mature EDSFF form factors help enable higher density?While the EDSFF E3 family is well established for enterprise SSDs, the E3.L 2T variant remains underused—typically reserved for computational storage with large FPGAs and heavy thermal requirements. But if the full 2T height were dedicated to maximizing QLC NAND volume, it could enable ultra‑dense flash devices with better performance‑per‑watt, reduced datacenter footprint, no vibration sensitivity, and improved endurance relative to Nearline HDDs. This session explores the architectural and datacenter‑level implications of using E3.L 2T for high‑capacity QLC and how dense flash can become a cost‑effective addition to modern datacenters.
Given the fundamental physical challenges in quantum computing that result in frequent and unavoidable errors, significant efforts are put into fault-tolerant quantum computing architectures enabling the execution of complex, long-running, error-free computations. At its core, Quantum Error Correction (QEC) going beyond traditional error mitigation has been introduced to address the high error rates. There is a wide range of error sources including bit or phase flips, noise, leakage, gate errors, hardware imperfections, as well as initialization or measurements errors. From a quantum memory perspective, errors can happen on idle data or check qubits. They need to be detected and corrected while instructions are being executed.
In this talk we analyze major similarities and disparities from error correction in traditional storage and memory devices and give an overview of state-of-the art QEC algorithms with a focus on a family of low-density parity-check codes. For these codes, about an order of magnitude more physical qubits are required than logical ones. Finally, we give an overview of the tools and infrastructure needed to evaluate the performance of such codes.
LPDDR6's is a breakout DRAM with advanced features—Efficiency Mode, Meta Data on Data Bus, X6 Mode, System Meta Mode including carved-out memory, PRAC, dynamic frequency scaling just to name few—create complex verification challenges beyond traditional Commands, data, Timings, Registers and DRAM state machine coverage. Feature interactions and configuration-dependent behaviors generate exponential scenario spaces that conventional approaches inadequately address. This paper presents a feature-centric coverage framework employing Randomization to generate targeted bins across mode transitions, operating speeds, bus width and density variations. Our coverage framework leverages some of the in-house AI tools to parse specifications and detect coverage gaps which are used for targeted testcase creation. This methodology significantly improves coverage closure and reduces manual effort, providing a scalable solution for validating complex next-generation memory protocols.
Large language model inference is increasingly constrained by per-session KV cache growth in multi-turn and long-context workloads. Many KV-offload approaches treat SSD as a generic spillover tier and optimize for averages, leading to unpredictable tail latency, rehydration read amplification from fragmentation, SSD-unaware packing that turns continuation into many small reads, and read/write interference that destabilizes QoS.We present a storage-centric study of KV-cache persistence and rehydration using a behavioral simulator that models batched inference pipeline, captures host-staging backpressure, and makes layout/indexing first-class in read planning. We compare request-end flushing vs token-streaming persistence and alternative packing/placement policies that shape the SSD I/O stream. We translate I/O and latency results into SSD requirements (bandwidth, mixed-workload QoS, endurance) and deliver quantified rehydration costs plus actionable KV layout guidelines to reduce fragmentation and stabilize tail latency. Attendees will leave with an actionable framework for reasoning about KV offload tradeoffs and storage design priorities for LLM inference.
Heat Assisted Magnetic Recording (HAMR)marks the next step in magnetic recording as capacities move beyond 30TB and continue to grow. While density gains are well understood, operating HAMR drives in production introduces new qualification, reliability and fleet management considerations. Localized heating affects media behavior, write stability and failure patterns, requiring adjustments in characterization, burn in strategy, stress validation and telemetry thresholds.
This session shares practical lessons from qualifying and deploying HAMR drives in production environments and the framework used to prepare additional vendors for fleet introduction. We will discuss how reliability modeling, rebuild dynamics and telemetry interpretation evolved and how drive level signals such as FARM/FACT telemetry logs and other vendor diagnostic logs are used to identify emerging failure patterns. The focus is on operational reality and ensuring that higher capacity translates into sustainable fleet scale deployment.
Enterprise SSDs must meet extremely stringent data integrity targets (e.g., UBER < 10⁻¹⁸) under compounded stress conditions—end of life cycling, power on/off data retention, temperature variation, and disturb effects—as defined in OCP 2.7. Under these conditions, elevated bit error rates significantly erode the narrow reliability margin of QLC based drives, often requiring costly excess ECC over provisioning. Current enterprise SSDs rely on two independent protection mechanisms: LDPC codes for random error correction and XOR based erasure coding for memory defects, each consuming dedicated redundancy. This paper introduces Joint LDPC & XOR Decoding (JLX), a unified approach that exploits XOR redundancy to generate auxiliary soft information for LDPC decoding, substantially boosting its effectiveness. Simulation and analytical results show that JLX extends the correctable BER range by up to 2.5× without increasing total redundancy, enabling compliance with demanding enterprise reliability requirements under aggressive stress stacking. JLX offers a cost efficient path toward robust QLC adoption with minimal over provisioning.
The NVMe Live Migration capability (TP 4159) represents a significant advancement in storage virtualization, enabling transparent migration of NVMe controllers between NVM subsystems while actively processing host commands—a critical requirement for modern cloud and datacenter infrastructure. This presentation will explore the Host-Managed Live Migration architecture, including the newly defined commands: Track Send/Receive for monitoring user data and host memory changes, Migration Send/Receive for suspending, resuming, and transferring controller state, and the Controller Data Queue mechanism for efficient change logging.
Key topics include:
Quantum Computing is already a reality, with multiple companies announcing significant advances in this area. Just like engineers can design very complex SoC without the need to understand the underlying transistor technology, one can design a very advanced Quantum Computing algorithm to address complex search or encryption problems without having to know the underlying theory. However, engineers are always curious to know more. This presentation will cover a simple overview of the theory behind the most fundamental Quantum concepts responsible for making Quantum Computing possible: Superposition and Entanglement.
Advancing AI infrastructure demands increasingly stringent DRAM quality requirements as technology scales below 13nm node, where tighter process controls are required to manage quality & reliability mechanisms. In this presentation, we will discuss key DRAM mechanisms focused on cell-to-cell interference, retention variability & contact resistance, gate oxide integrity and electromigration. High Bandwidth Memory (HBM) introduces additional quality imperatives—including ultra low defect densities, robust TSV/interposer reliability, and power/thermal resilience needed for advancing HBM stacked architecture and extreme bandwidth demands. On the flash side, the transition to 500+ layer 3D NAND amplifies challenges such as Charge Trap cycling endurance & retention degradation, read disturb amplification, layer dependent variability across the stack. We will discuss future 3D-NAND enhanced screening methodologies, and key architecture, process and materials innovations that will be needed for future storage AI infrastructure deployments.
SuperWomen at FMS Peer Exchange Happy Hour
All Industry Reception
Full list of tables to be published in due course.
Create your personal agenda –check the favourite icon
Registration open throughout the day in the Santa Clara Convention Center, First Floor.
Chair's Remarks
Chair's Remarks
Speakers to be determined.
Chair's Remarks
Chair's Remarks
Speakers yet to be determined.
Chair's Remarks
Chair's Remarks
Chair's Remarks
The unrelenting pace of evolution in AI systems continues to present new memory bottlenecks. Various use cases are hitting different memory walls: KV Caches require bandwidth and capacity while being latency tolerant; model offloading requires the lowest latencies at moderate queue depths (usable IOPS); GNN training requires saturating bandwidth at small IO sizes (maximum IOPS); checkpointing requires sustained write bandwidth. NAND provides multiple methods of addressing each of these challenges through various system architectures.
In this session we will explore the various use cases and their specific memory wall problems, the systems being designed and deployed to address the set of walls, and what levers can be used to optimize the Total Cost of Ownership (TCO) for the various solutions and use cases.
As QLC SSD performance increases every generation, the problem of how much power is used for writes becomes increasingly important. QLC NAND typically exhibits slower write performance than TLC NAND (which can mean increased write power). QLC writes are normally a two-step program called “foggy-fine”, but generally the first “foggy” write is not readable and that data must be protected against power failure. The standard protection method is to copy the data into an SLC cache until the data is fully programmed into the NAND, but this has material write power/performance impact.We presentQLC Direct Write– a novel approach that eliminates the need for an SLC cache to protect the host data until the QLC program is complete. This technology means that the data is in a readable state after the first step program, so no additional PLP protection is needed.Direct Write avoids the power impact of writing and reading the SLC cache, allowing better write power efficiency. It also eliminates SLC Cache OP impact (which would consume a portion of the drive NAND capacity), which means for random write workloads the SSD Write Amplification is lower which can further boost performance
Data centers are deploying ever increasing capacity Solid State Drives (SSDs). These larger capacity SSDs improve costs, reduce system power consumption, and optimizes storage utilization. At the same time, the number of tenants per data center is growing. These end users are maintaining stringent performance expectations based on prior experience with direct access to prior generation SSDs. This presentation will provide an example Virtual Machine (VM) that migrates from a private SSD onto a shared SSD. Building on this example, a proposed emulation method will be discussed using Flexible Data Placement (FDP), rate limiting, and minor extensions. Our goal is to spark a community discussion and gauge industry interest in sustainable, high‑density storage solutions for multi‑tenant data‑center environments.
Generative AI adoption is accelerating inside enterprises—but most organizations are securing infrastructure, not AI behavior.LLMs introduce a fundamentally new attack surface: prompt injection, jailbreak attempts, data exfiltration, malicious URL generation, code injection, toxicity, off-topic drift, and policy bypass. Traditional AppSec and network security controls were never designed to evaluate intent inside natural language interactions.
In this session, Nishank, Senior Staff Software Engineer at Zscaler, breaks down how runtime AI protection works in real-world enterprise environments. Drawing from experience building AI Guard and AI DSPM systems, he will explain how intent-based detectors enforce guardrails on both prompts and responses, how proxy vs sidecar (DaaS) architectures change your threat model, and how organizations operationalize AI governance at scale.The talk will cover:- The emerging AI attack surface in LLM-powered applications- How runtime guardrails prevent prompt injection and jailbreaks- Architecture trade-offs: Inline proxy vs API-based DaaS deployment- Designing scalable detection pipelines using model inference
Long-context LLM serving is a persistent-context problem: KV cache grows with context length and concurrency, so state spills beyond GPU memory into a many-tier hierarchy across host DRAM, CXL memory, and SSD. Under load, eviction and refill form a bidirectional KV stream. Keeping GPUs saturated needs overlapping compute and transfer, which depends on pinned DMA staging. That pinned staging budget is fixed by platform constraints, so KV bursts can saturate it first, causing thrash and making many-tier persistence infeasible.
We present a tiering runtime on NVMe-oC hardware that combines CXL memory and SSD, built by extending LMCache and SPDK. We enable persistent-context by adaptively splitting a pinned DMA budget into two non-competing regions: a store buffer for GPU evictions and a prefetch buffer for GPU refills. Hot and cold tagging per KV offload unit guides placement, and very cold blocks bypass staging via direct GPU-to-SSD DMA through our SPDK path. On a 2-GPU system serving 40 to 70B models at high concurrency, we observe 30 to 50% higher throughput on the same hardware.
Now that processors are being broken down into chiplets, each with its own specialty like processing, I/O, and cache, will it make sense to abandon today’s standard electrical I/O like DDR and PCIe along with SRAM caches? The answer is a resounding “Yes!” Attend this session to learn how today’s shift to chiplet-based processors will lead to new architectures, algorithms, and benefits for not only high-end processors, but all though the world of digital electronics, all the way down to the smallest processors used in battery-powered Internet end points. This session presents the architecture of the future, and gives insight to what today’s system developers and business leaders must do to keep pace with the enormous changes that are just around the corner.
HBM has become the dominant cost driver in large-scale AI inference systems. Capacity pressure stems not only from large model weights, particularly in Mixture-of-Experts (MoE) architectures, but also from rapidly growing KV cache during long-context decoding. Meanwhile, strict device-level reliability constraints inflate HBM $/GB. This talk presents a unified architecture to reduce effective HBM cost per token by addressing both reliability overhead and memory residency pressure without changing the standard HBM interface.We treat reliability as a controller-defined resource, enabling relaxed raw BER targets through coarse-grained protection and selective safeguarding of critical data fields. We further position HBM as a high-bandwidth cache backed by larger LPDDR/CXL memory. Dynamic KV placement distributes state across tiers to aggregate bandwidth under capacity constraints, while inactive MoE experts remain compressed in the lower tier to reduce footprint and migration bandwidth. Together, these mechanisms lower HBM cost per token while preserving throughput and correctness.
Enterprise SSDs inevitably undergo performance degradation as they accumulate wear, driven by diminishing available spares, rising write amplification, and block retirement. While modern SSDs expose a variety of health indicators, the industry lacks a predictable, data‑driven approach for identifying when an SSD is about to experience a significant performance drop. Existing telemetry provides raw indicators, but no practical framework connects them to an actionable EOL performance threshold.
By systematically leveraging SMART attributes and OCP‑defined log pages—including available spares, over‑provisioning utilization, and total bytes written (TBW)—operators can establish a reliable model for tracking SSD aging. These parameters enable determining a repurposing point before the device reaches its performance cliff, ensuring that drives transition smoothly to lower‑demand roles rather than causing unexpected service degradation.
In 2026, cybercriminals are leveraging AI-driven attacks, living-off-the-land techniques, and multi-stage ransomware campaigns that can remain undetected for months while targeting both primary and secondary data stores. Without built-in cyber resilience at the enterprise storage layer, attackers can compromise recovery paths, encrypt data, and turn enterprise storage into a force multiplier for ransomware, malware, and cyber threats.
To stay ahead, organizations must move beyond perimeter defenses and adopt an end-to-end cyber security resilience strategy—one where enterprise storage is part of mission-critical cyber security control, not just infrastructure. This session challenges IT leaders and CISOs to rethink enterprise storage as a foundational element of a enterprise’s modern cybersecurity strategy.
Attendees will learn:- How ransomware and malware tactics have evolved and why storage is now a prime target- Why cyber-secure storage is essential to detection, containment, and rapid recovery- Practical steps to integrate storage into a zero-trust, resilience-first security framework
Transform your storage infrastructure into a core pillar of cyber resilience.
The NAND has shifted from cyclical supply swings to structural constraint driven by the AI boom. Hyperscalers are securing forward capacity for training and inference, tightening global supply and increasing allocation risk. In this environment, efficiency and resilience matter. Pure’s DirectFlash architecture, industry-leading data reduction, multivendor sourcing, and hyperscale purchasing leverage enable predictable, high-performance supply despite market volatility.
AI inference spans diverse workloads, from low‑latency chat to long‑context reasoning and large‑scale recommendations—making single, monolithic accelerator and memory designs increasingly inefficient. This talk explains how inference naturally splits into prefill and decode stages with fundamentally different bottlenecks: prefill is compute‑bound, while decode is dominated by memory bandwidth and latency. By matching memory technologies to each stage, using cost‑efficient GDDR or LPDDR for prefill and reserving premium HBM for decode, with pooled memory for KV offload, operators can significantly reduce cost per token without sacrificing latency. The session outlines emerging disaggregated architectures for AI inference workloads.
Every LLM input token creates key-value cache entries in GPU HBM during inference. Today's AI agents routinely push contexts to 100K+ tokens, consuming gigabytes of KV cache per request. The industry response has been hardware-centric: more HBM, KV cache offloading to flash, CXL-attached memory. These approaches address the symptom. Context compression addresses the cause.
This talk presents Headroom, an open-source system (460+ GitHub stars) that compresses LLM input tokens by up to 80% before inference — directly shrinking KV cache footprint by the same ratio. A 128K-token agent context drops from ~4GB to ~800MB, enabling 5x concurrency on identical hardware. The compression is deterministic, model-agnostic, adds <100ms latency, and is fully complementary to hardware-layer solutions — paged attention and flash-based offloading both benefit when operating on already-compressed context.
AI workloads are driving unprecedented power demand in modern data centers, with rack densities exceeding 100 kW and rapidly rising global consumption. While GPUs dominate most power discussions, NVMe SSDs remain a significant and often under monitored contributor to rack level energy use. The NVMe 2.3 specification addresses this gap through two complementary capabilities: TP4199, which standardizes 1 second power measurement, lifetime energy reporting, programmable thresholds, and persistent power logs; and TP4210, which adds rail level voltage measurement, threshold triggered asynchronous events, and persistent voltage history to expose power integrity issues invisible in aggregate data.
This session demonstrates these features in practice. A TP4199 based case study compares measured versus estimated SSD power under high power workloads, highlighting reporting differences. A complementary TP4210 aligned scenario shows how per rail voltage telemetry and logged events can reveal issues hidden when relying solely on aggregate measurements. Together, these examples show how standardized SSD level telemetry improves visibility for managing power intensive AI deployments.
The AI/ML era is driving an unprecedented explosion of data generation,making confidentiality and integrity of data mission critical. Hence,industry has embraced security protocols such as IDE, SPDM, that provide authentication and attestation capabilities across storage/memory stack. Leading-edge solutions such as NVMe/CXL-based devices are being designed with these protocols,necessitating comprehensive validation of the security mechanisms. However,current test tools face limitations in modularity and debuggability, restricting validation scope.This paper presents a Python‑based SPDM SDK for establishing mutually authenticated SPDM (Security Protocol and Data Model) session and establishing secured sideband management. This framework leverages cryptographic libraries and defines seamless flow to validate SPDM protocol. The framework provides strong cryptographic guarantee, enabling safe and reliable management functions over sideband channels. We will also discuss on demonstrative use case for Secure SPDM session establishment and Firmware download over PLDM. This work contributes a practical,opensource solution for securing sideband management in emerging datacenter ecosystems.
As LLMs scale to billions of parameters and handle complex, multi-turn workloads, inference efficiency is no longer determined solely by compute power — but by how intelligently KV cache is managed across memory and storage tiers. This talk explores a novel architecture that situates KV caching at the critical junction between GPU memory and hybrid storage. Using Linux volume groups and SPDK for NVMe over Fabrics, we treat SSD/HDD tiers as active memory extensions, not passive backends. Frequently accessed KV states remain in fast layers; less active data moves to cost-efficient storage — eliminating redundant attention recomputation. Integrated with the Dynamo KB Block Manager and dynamic logical volumes, this reduces time-to-first-token and power consumption, while easing GPU memory (HBM) pressure. Result: higher concurrency, more simultaneous users — without sacrificing responsiveness. The system adapts to real-time workload patterns, improving throughput and lowering operational cost. A practical, scalable solution for production LLM deployment.
The traditional approach to storage security is based on a monolithic ASIC that contains all the functions on a single die. Security solutions like Caliptra are evolving rapidly and on a wholly different cadence from the drivers that push storage solutions. An SSD controller that picks up the current Caliptra release at “design freeze” will often be 18-24 months behind the latest IP release when the product comes to market. We will show how a chiplet based solution that is cryptographically bound to the storage controller solves this problem and allows for a security refresh mid-cycle on a product that is already in the market. The functional isolation means that minimal effort is needed to qualify the refreshed product.
As SSD capacities grow to 1PB, how can we keep the SSD indirection unit (IU) from growing? An IU represents the unit at which the SSD tracks host data. Host writes smaller than an IU, typically cause the SSD perform read-modify-write (RMW) operations. Those RMWs are expensive overhead on performance and NAND endurance. Therefore, it is best when the SSD host stack is designed to align with the SSD IU. Typically, SSDs around 32TB or smaller support a 4K IU, but larger SSDs are supporting a 16K or even higher IU. In this talk we explore an SSD architecture preserving the 16K IU, as SSD capacity scales and the potential tradeoffs for doing so. Further, this architecture would enable small form factor drives, such as E3.s, to scale to 256TB. The key change in this architecture is moving the host-data mapping table from DRAM to NAND. This approach breaks the linear growth of SSD DRAM with SSD capacity with the benefits of a consistent 16K IU as drive capacity scales, and more PCB space for NAND, by having less DRAM. We will review measured data from drives showing the tradeoffs in workload with these new reduced-DRAM drives to learn what workloads tradeoffs exist in this architecture.
As data center workloads diversify across storage, compute, and AI platforms, thermal management strategies must evolve to meet distinct operational demands. Traditional air cooling remains viable for conventional storage with moderate power envelopes, while compute-intensive and AI environments increasingly demand direct liquid cooling and immersion solutions for targeted heat extraction at extreme power densities.This presentation quantifies thermal dissipation budgets across cooling technologies—comparing air, direct liquid, and immersion cooling capabilities for current and next-generation SSDs operating at PCIe Gen6 and Gen7 power levels. We examine how each cooling approach aligns with specific platform requirements and power roadmaps. For newer applications like network attached storage used for inference, high-density SSDs are not only nice to have, but requirements. With interface speeds, even at network attached storage, moving from Gen5 to Gen6 and 7, thermal management becomes a priority, especially when massive scale of storage deployment is needed to feed data to AI compute. The session will also cover impact to SSD thermal architecture due to fan-less AI servers.
Key Per IO is a NVMe standard that enables confidential computing, allowing hosts to control security keys. However, multi-tenant environments often need more features, including effective key management, optimized memory utilization, and simplified processes for tenant on-boarding and off-boarding.In this paper, we propose a scalable approach to deploying Key Per IO, including techniques for key distribution and ownership, per-I/O encryption handling, and scaled tenant management.By integrating these enhancements, the industry can fully realize strict key separation through zero-trust storage models, tenant-controlled cryptography, and device-independent key ownership, driving more robust and scalable security solutions.
High Bandwidth Memory (HBM) has emerged as a critical enabler for artificial intelligence (AI) workloads, offering the massive bandwidth and low power consumption necessary to meet the growing computational demands of deep learning and high-performance computing. HBM is a pivotal piece of technology for AI training as well as AI inference due to its high bandwidth and comparable low latency, which enables speedy data access and its processing, helping in overseeing the large datasets and performing complex calculations. With AI models continuously expanding in complexity, efficient verification methodologies for HBM devices are essential to ensure reliability and performance across various configurations. To meet this rising need, advanced verification methodologies must cover a wider operating range, increased use-cases, and more complex error scenarios.
Our approach significantly reduces Turn-Around Time (TAT) by meticulously generating testcase scenarios derived from both extensive in-house knowledge and real-time customer usage data, thereby making them exceptionally close to real-world operational conditions.
The presentation will showcase IBM’s breakthrough in content‑aware storage (CAS) through the development of a 100‑billion‑vector database running on a single server, enabled in large part by Samsung’s advanced enterprise solid‑state drive technology. By integrating Samsung’s high‑density PM9D3a PCIe® Gen5 NVMe™ SSDs—each delivering up to 30.72TB of capacity and exceptional sequential read/write performance—IBM Research achieves unprecedented vector density and throughput within the IBM Storage Scale System 6000. Samsung’s cutting‑edge, mass‑produced SSDs form the foundation of a storage architecture capable of supporting extreme‑scale semantic search with over 90% recall precision and sub‑700‑millisecond query latency. Combined with IBM’s hierarchical GPU‑accelerated indexing strategy and partnerships with NVIDIA and LanceDB, this collaboration demonstrates how Samsung’s leadership in memory and storage technologies plays a critical role in enabling AI‑driven retrieval‑augmented generation (RAG) at enterprise scale. The result is a storage‑centric AI platform that empowers organizations to unlock value from proprietary data while minimizing infrastructure complexity and cost.
As QLC NAND densities reach 61TB and beyond, the storage industry is rapidly approaching a strategic crossover point with high-capacity HDDs. However, raw acquisition cost remains a significant hurdle for many enterprise users. This session presents a comprehensive joint study by Scality and the Samsung Memory Research Center (SMRC) introducing the "Performance TCO" (PTCO) model. We move beyond simplistic $/GB metrics by benchmarking real-world datasets across hundreds of diverse HDD platforms globally and comparing them directly against lab results obtained on SMRC QLC platforms. By examining 16KB cell alignment, Write Amplification Factor (WAF) in sequential SDS workloads, and high-throughput use cases like Veeam Backup & Replication targets, we illustrate how QLC is positioning itself as a vital economic and performance-driven alternative for modern high-density object storage environments.
Confidential Compute (CC) completes the trifecta of data and code Protection - while in use via Trusted Execution Environments (TEEs). It ensures the ‘CIA’ Confidentiality Integrity and Authenticity, during data processing for secure and privacy-preserving computing. The key enablers of CC include Secure Boot, Attestation - DMTF SPDM, Memory Encryption.Data center industry’s goal is to make CC ubiquitous, to minimize performance hit / friction, for wider and seamless, reliable FW / SW updates and adoption.CC using HW-based, attested TEEs, protects sensitive data and code against threats during data execution.It allows for the protection of data in use, even against an adversarial platform owner.This is achieved through, Hardware-based isolation (e.g., Intel SGX, AMD SEV, CXL TE bits, ARM Realms CCA), attestation to verify the integrity of the TEE before use. Orchestration of CC over memory fabric is also done.CXL TSP defines mechanisms to include CXL memory devices within the TEE. CXL tracks cache coherence at the cache line level, 64 bytes.The talk will go into security requirements and behaviors that are used to support CC use cases and cover architecture and design of CC.
As LLMs increasingly handle long-context workloads, the memory pressure on KV caches has emerged as a critical bottleneck for performance and scalability. We propose an architecture that offloads KV cache from HBM to CMM-Ax and performs sparse attention operations directly on Processing-in-Memory (PNM). By exploiting the inherent characteristics of sparse attention, we design an architecture that maximized PNM's bandwidth utilization and fully capitalized on the PNM's scalable capacity. Built atop Ethernet-based node-level disaggregated architecture, the end-to-end system integrates real PNM hardware, RoCE v2 stack, and device-level optimizations. We implement split-batch routing and parallel execution with GPU attention to maximize GPU utilization and consequently alleviate Head-of-Line (HoL) blocking during long-context inference — significantly improving overall system efficiency.
The contemporary landscape of multi-die systems is characterized by rapid advancements and intricate integrations, encompassing a wide array of protocols (e.g., PCIe, CXL, AXI, CHI, UALink, Memories) and interfaces (e.g., C2C, CXS). The strategic adoption of on-package integration, coupled with the performance advantages afforded by UCIe, necessitates a comprehensive and holistic system-level verification strategy for chiplet designs. This approach is critical for meticulously addressing synchronization issues, mitigating timing variations, and ensuring robust protocol interoperability, thereby guaranteeing optimal performance, fortified security, and inherent system stability.We will discuss the architecture of a versatile adapter engineered to facilitate the seamless integration of disparate protocol layers within a UCIe verification environment. This solution streamlines the development and execution of system-level verification TB. Empirical validation through multiple design implementations will be presented, demonstrating the architecture's efficacy in enabling the reuse of existing protocol testbenches with minimal adaptation for UCIe-specific verification requirements.
The current generation of SSDs operate on 4kByte block boundaries resulting in reads and writes to Flash devices being organized around this size of information block. AI workloads operate on smaller granularities than the traditional 4k Byte SSD FEC block often operating on 512B or 1k Byte blocks of information. At the same time read latency for AI operation must be as small as possible. These smaller blocks must have low latency without sacrificing error correction strength.We present a novel block structure that supports low latency reading and decoding for AI workloads along with a decoding procedure that also provides excellent error correction performance. The forward error correction method supports fast small block reads and decodes, a second layer of hard-decision decoding, and also supports multi-read soft decoding to recovery any high-error blocks from the Flash.
Placing High IOPS SSDs near to GPUs is a viable solution to solving the memory capacity challenges faced by GPUs today. As problem sizes and data sets grow, the GPU’s existing memory space proves to be insufficient, and data must be fetched from other storage layers. SSDs optimized for GPU-initiated access can solve this problem by enabling a scalable tier of Storage Next to HBM. Optimized for ultra-high performance with novel, highly parallel access models, which can deliver never-before-seen levels of NVMe performance optimized for GPU-initiated I/O.
Refreshment Break
Chair's Remarks
Chair's Remarks
Chair's Remarks
Speakers to be determined.
Transformer-based generative AI has turned the key/value (KV) cache into one of the largest and most performance-critical working sets in modern AI systems. As context windows grow and request concurrency rises, KVCache capacity and bandwidth increasingly determine latency, throughput, and total cost; often driving decisions around GPU/HBM sizing, host memory, and storage tiering. This session brings together system builders and memory/storage architects to examine KVCache management end to end: data layout and access patterns; paging, allocation, and eviction; compression and quantization; multi-GPU and multi-node sharing; tiering and offload to host DRAM and NVMe/SSD; and reliability, isolation, and security considerations in multi-tenant deployments. We will connect software techniques to emerging hardware directions (e.g., higher-bandwidth memory, pooling/tiering, and disaggregated memory/storage) and highlight where cross-layer co-design is needed. Attendees will leave with a practical taxonomy of KVCache techniques, guidance on when to use each approach, and a set of metrics and workload characteristics to evaluate solutions in production.
Chair's Remarks
A precise abstract will be provided after approvals from the participating companies, est end of 1st week of March. Below is a placeholder:As AI Large Language Models (LLMs)—continue to scale up in size, the industry has collided with the "Memory Wall", with memory bandwidth and capacity growth limitations, leading to severe bottlenecks in AI inferencing performance, energy efficiency, and total cost of ownership (TCO). This panel explores the emergence of High Bandwidth Flash (HBF) as a disruptive architectural shift designed to bridge the gap between volatile HBM (High Bandwidth Memory) and traditional NAND storage.The session will discuss how HBF technology aims to redefine the memory hierarchy by providing near-memory speeds with the density and persistence of high bandwidth flash. The panel will bring together experts from both HBM/HBF solution providers, as well as Hyperscale- AI Infrastructure provider, to dissect the HBF technology usage and development needed for success, including Architectural Integration, Technical Challenges, Standardization timelines, performance and economics.
As the AI landscape shifts from training-heavy models to massive-scale inference deployment, the industry faces a critical challenge: How do we store and manage the unprecedented volume of AI-generated data? This session provides a data-driven roadmap for infrastructure architects and industry leaders. We begin with a comprehensive forecast of data volumes produced by AI inference services, illustrating the shift in storage demand. Central to this discussion is a refined Data Tiering Strategy, where we analyze the evolving requirements across "Hot," "Warm," and "Cold" storage categories to optimize performance and cost.
Furthermore, we will examine Market Share Allocation, specifically focusing on the accelerating transition of AI workloads toward NAND-based storage solutions. To conclude, we provide a Strategic Supply Outlook, offering a long-term forecast of the NAND supply landscape to help organizations future-proof their AI infrastructure.
Update on the things that hyperscalers care about for storage.
Update on the things that hyperscalers care about for storage.
Update on the things that hyperscalers care about for storage.
The rapid proliferation of on-device AI and expanding frontier of edge computing are placing unprecedented demands on mobile flash storage architectures. Existing storage interfaces have become a critical bottleneck in realizing full potential of next gen AI platforms.This proposal presents a comprehensive deep dive and design verification challenges into recently released MIPI UniProv3.0 and MPHY HS-G6 which have served as the interconnect layer for JEDEC UFS, enabling high-performance, low-power flash storage across a broad range of devices.For the first time in mobile storage history PAM4 signaling come to MPHY HS-G6 with a leap of 46 Gbps per lane bandwidth which makes UFS5.0 not just an incremental update but a generational shift in unlocking speeds previously confined to datacenter SSDs now delivered with power and thermal envelope of a smartphone SoC. Sitting above the MPHY is UniProv3.0 with its redesigned Transport Frame Structure TFS, forward error correction (RS-FEC), 64-bit cyclic redundancy check (CRC), TFS data scrambling, gray coding, precoding and lane alignment features to operate at a bit error rate (BER) of less than 10-22 at the application layer.
The rapid progress of quantum computing poses an imminent risk to classical cryptographic methods, compelling the storage industry to transition toward quantum‑safe security architectures. Recent standardization milestones—such as NIST’s finalized post‑quantum cryptography (PQC) algorithms including ML-KEM, ML-DSA, LMS and XMSS—established the foundational tools required for securing future HDDs and SSDs against quantum‑enabled attacks. Industry organizations are now aligning storage‑security specifications with these PQC requirements.
This talk will outline the quantum‑safe readiness of each major storage‑security standard, examine cross‑industry migration timelines, and highlight the strategic role of HDD/SSD vendors and System Integrators in ensuring cryptographic resilience before the arrival of large‑scale quantum systems.
The introduction of UCIe manageability architecture has transformed the way chiplet-based System-in-Package (SiP) platforms are configured, controlled, and debugged. UCIe introduces a robust manageability framework, enabling structured configuration, communication, and monitoring between management entities embedded in chiplet. The flexibility of the manageability architecture enables the modeling of a management fabric with diverse topologies, including point-to-point, mesh, and daisy chain configurations. The diversity of topologies introduces non-trivial challenges in developing a robust verification solution for UCIe-based designs. This presentation aims to explore two critical challenge areas in depth: management discovery along with routing and security-enforced access mechanisms. Our goal is to identify strategic algorithms that simplify their implementation, ultimately streamlining the verification process for UCIe systems. The proposed algorithm ensures acyclic routing and avoids deadlocks in management topologies with redundant or cyclic links. It also smartly configures elements to restrict unauthorized access in security-sensitive scenarios.
AI is accelerating growth in data center scale, power demand, and infrastructure complexity. As AI workloads expand, terrestrial limits (including grid capacity, physical security, environmental risks, and geopolitical exposure) are shaping long-term infrastructure planning. Space is emerging as an extension of global compute infrastructure. Google’s Suncatcher, NVIDIA-backed Starcloud and SpaceX with xAI have outlined distributed, satellite-based AI architectures. Instead of a single orbital station, these models rely on satellite swarms built for scalability, fault tolerance and strategic resilience across commercial and defense use cases. Storage is central to this shift. AI systems require high-capacity, high-performance NAND that operate reliably in space. Traditional rad-hard designs are capacity-limited and cost-prohibitive. Advances in controllers, LDPC error correction, firmware mitigation, and system-level recovery now enable commercial TLC and QLC NAND to meet space reliability demands without fully custom rad-hard components. With launches targeted as early as Q4 2026, the industry is advancing toward TRL-9 storage platforms validated in orbit.
Exhibition opens in the main exhibition halls.
In 1987, Jim Gray’s Five-Minute Rule provided a simple economic guideline for deciding when data should reside in DRAM versus storage. Revisited multiple times over four decades, the break-even interval consistently remained on the order of minutes, reinforcing flash as a secondary storage tier. This talk reexamines the rule from first principles in the AI era. We introduce a feasibility-aware framework that integrates host processor cost, DRAM bandwidth and capacity, device-level NAND timing, channel parallelism, and realistic SSD IOPS/$ scaling. We show that when GPU-centric hosts are paired with Storage-Next SSDs delivering 50M+ small-block IOPS, the DRAM↔flash caching threshold collapses from minutes to seconds.This shift promotes NAND flash from a passive capacity layer to an active extension tier of memory, with GPUs emerging as high-throughput I/O engines. We will present analytical insights, device-level modeling results, and system-level implications for AI infrastructure, including vector databases, recommender systems, and large-scale inference. The result is a new provisioning framework that redefines memory-storage balance for modern AI workloads.
High Speed Link Startup Sequence (HS‑LSS) is an emerging UFS capability that reduces storage bring‑up latency and can improve end‑to‑end boot KPIs. Ecosystem readiness varies, so SoCs must support both HS‑LSS and legacy LS‑LSS during a multi‑generation transition to maintain compatibility.This presentation describes Qualcomm’s production enablement of HS‑LSS with deterministic early‑boot selection across mixed UFS device configurations. Because devices may power up expecting HS‑LSS or LS‑LSS depending on hardware configuration, the host must select the correct startup sequence early—before any stage that depends on storage access. Our architecture makes this choice OEM‑selectable while reusing existing boot‑configuration mechanisms (no dedicated GPIO and no one‑time fuse/SKU lock‑in). Results demonstrate measurable improvements in time‑to‑UFS‑ready and boot-path determinism while maintaining backward compatibility, with a clear path to retire LS‑LSS support as adoption matures. Since link startup can occur multiple times (cold boot, warm reboot, recovery), the startup-time gains can compound linearly over a platform’s operational lifetime.
Data security is paramount for modern data centers. To protect sensitive data from cyberattacks and also from exfiltration after the devices are decommissioned. Several industry and government regulations mandate strict end-to-end data security framework. The data at rest encryption in an SSD is a key component of this framework. The implementation of these technologies in SSDs faces several challenges. One of them is related to fundamental limitations of the AEAD (Authenticated Encryption with Associated Data) algorithms used. This presents an overview about some of the challenges and mitigation strategies that SSD designs can deploy.
Artificial intelligence and machine learning pipelines impose diverse, demanding storage access patterns that traditional SSD benchmarks fail to capture. We present a trace-driven, end-to-end benchmarking methodology reflecting true AI/ML workloads: from random-read data loading and checkpointing to burst feature writes and high-concurrency inference. By reconstructing empirical I/O traces as synthetic fio workloads, we benchmark multiple NVMe SSDs across every ML pipeline stage. Our findings reveal that no single device is best for all tasks drive performance varies widely by workload, exposing differences in concurrency handling, write endurance, and real-world latency. We offer practical insights and recommendations for AI practitioners, infrastructure engineers, and SSD vendors, enabling evidence-based storage selection and system tuning for modern ML applications.
As AI workloads become more complex and data-intensive, the need for increased computational density and memory bandwidth is paramount.Chiplet-based architectures offer a solution by disaggregating monolithic system-on-chips (SoCs) into smaller, specialized dies.This talk will analyze the needs and benefits of chiplet technology and a significant portion of the presentation will be dedicated to the architectural requirements and development of novel memory solutions using chiplet for AI-driven memory applications in the data centers and mobile devices using CXL memory, cHBM, and HBF (High Bandwidth Flash) and additional upedate of HBS (High Banddwith Storage) which is mobile chiplet solution of SK hynix based on MOSAIC Platform and UCIe Interface.
This presentation proposes a cross-layer framework that integrates physical-layer mitigation with adaptive system-level recovery. To address radiation-induced Vth shifts in 3D NAND, the approach proactively adjusts read reference voltages and triggers data scrubbing before errors surpass LDPC correction thresholds. To combat catastrophic firmware corruption where multi-backup ISP codes fail, the framework implements autonomous firmware recovery via an onboard MCU design; for space-constrained HS BGA SSDs, recovery is achieved through a System-Level mechanism requiring integrated support across SSD and host hardware, firmware, and software.
A panel on overcoming flash storage constraints and supply scarcity limiting enterprise AI adoption. The discussion will bring together leaders from VAST Data and other ecosystem partners to examine the real-world impact of capacity shortages, cost pressures, and performance demands driven by large-scale AI and GPU infrastructure.
The session will explore how architectural innovation can extract more usable capacity, improve endurance economics, and extend flash supply without compromising performance. Panelists will provide practical guidance for enterprises seeking to scale AI responsibly and profitably, aligning storage strategy with the next wave of data-intensive workloads.
As Generative AI (GenAI) model parameters continue to expand (7B+), edge devices face immense memory bandwidth pressure.This session analyzes the I/O characteristics of Edge AI inference and explores how UFS controllers—Host Performance Booster (HPB), and Smart Prefetching algorithms—can drastically reduce Model Load Time and Time-to-First-Token (TTFT), delivering a seamless AI user experience on Edge devices.
The transition to post quantum cryptography (PQC) is critical for securing firmware in memory and storage systems, as future quantum computers threaten classical public key algorithms. In response, NIST has released its PQC standards and published a multi-year migration timeline to guide the transition away from quantum vulnerable cryptography. In parallel, the NSA’s CNSA 2.0 suite mandates quantum resistant algorithms and recommends the immediate adoption of PQC mechanisms for software and firmware signing.
This talk presents a performance driven evaluation of PQC for storage systems by benchmarking CNSA 1.0 and CNSA 2.0 firmware image encryption and authentication algorithms on embedded CPUs. Benchmark results highlight key metrics such as signature verification latency and AES 256 firmware decryption performance. The talk also introduces a hybrid deployment model in which CNSA 1.0 and CNSA 2.0 algorithms operate in parallel, providing a built in fallback to classical CNSA 1.0 methods if a newly standardized PQC algorithm is later found to be vulnerable. This approach enables long term resilience as systems transition toward fully quantum resistant security.
We look at the move towards liquid cooled drives, the state of the standards and the implications for future drive designs. The push for high power density centers and high-performance AI workloads is accelerating the move towards liquid cooled drives.We look to answer these questions: What are the new power limits? What is the state of standards? How do form factor differences impact liquid cooling? How will the same drive operate in air-cooled and liquid cooled environments?
As AI and HPC workloads demand ever-higher memory bandwidth, traditional DIMM architectures face significant scaling challenges. This talk explores how integrated memory interface chipsets — including multiplexed registered clock drivers, data buffers, and dedicated PMICs — enable DDR5 MRDIMMs to deliver up to 16 GT/s while maintaining signal integrity and efficient power delivery. We examine module-level design tradeoffs, system implications for data centers, and a forward look at interface requirements for future DDR6 modules.
AI infrastructure is shifting from training-heavy, compute-bound systems to inference-dominated deployments constrained by memory bandwidth, capacity, data movement, and power. As model weights, embeddings, and KV caches expand, architectural innovation is moving beyond traditional GPU scaling toward wafer-scale systems, memory-first dataflow accelerators, chiplet-based inference ASICs, and inference-specialized designs.
The rapid rise of on-device generative AI across multiple platforms, from mobile, laptops, to edge devices is redefining storage requirements. These platforms can run large language models (LLMs), multimodal assistants, and persistent AI agents directly in the system without needing network connectivity. However, these systems require higher storage bandwidth, lower latency, and improved power efficiency to enable faster model loading, real-time inference, and seamless background AI operation.UFS 5.0 is designed for this new AI-first era. With next-generation interface speeds that significantly increase throughput over prior generations, along with architectural enhancements for improved signal integrity and energy efficiency, UFS 5.0 provides the performance headroom required for sustained GenAI workloads while preserving battery life and thermal limits. This session will explain the new UFS 5.0 interface, leveraging MIPI M-PHY 6.0 and UniPro 3.0 standard, and examine how UFS 5.0 addresses the performance bottlenecks at the system level.
Enhancing the security and management capabilities for next-generation SSDs is a key priority for Microsoft Azure. This talk shares the motivation, and specific direction Microsoft Azure is taking for new SSD capabilities to more comprehensively secure SSD firmware and protect SSD data-at-rest, while efficiently managing Azure’s very large multi-vendor fleet of SSDs. Capabilities addressed include I3C, Streaming Boot / Secure Firmware Recovery, Dual Firmware Signing, OCP LOCK for at-rest encryption key management, PQC, and Caliptra RoT integration.
Speakers to be determined.
Hyatt Regency Hallway, Mission City Ballroom Lobby
Chair's Remarks
Chair's Remarks
As AI workloads grow exponentially, traditional interconnects—designed for scale-out or proprietary deployments—are struggling to keep pace with the bandwidth, latency, and scalability demands of next-generation AI systems.
UALink is an open, memory-semantic fabric purpose-built for scale-up architectures, enabling direct memory access and atomic operations across up to 1,024 GPUs. The 200G 1.0 specification, released in April 2025, established a vendor-neutral foundation and since then, continued specification advancements have added additional features critical for production deployment in hyperscale environments. However, deployment requires broad ecosystem collaboration across hardware, software, and systems.
In this session, Astera Labs—a founding Board Member of the UALink Consortium—will join fellow panelists to discuss the practical challenges and breakthrough solutions behind deploying UALink at rack scale. Attendees will gain insight into recent specification updates, UALink’s security features and compliance considerations, memory semantics, protocol optimization, power and cost efficiencies, and the multi-vendor ecosystem driving this industry-wide transformation.
Panellists to be determined.
Chair's Remarks
Chair's Remarks
Chair's Remarks
Chair's Remarks
JEDEC specifications define interoperability and reliability—but innovation often happens at the margins. This panel convenes leading industry reviewers to examine what occurs when DRAM and SSDs operate beyond standardized parameters.
We’ll analyze scaling behavior, controller firmware response, error rates, thermals, signal integrity constraints, and endurance trade-offs. Panelists will share testing frameworks, failure modes observed in the lab, and the challenges of comparing results across platforms and BIOS revisions.
This interactive panel brings together respected hardware reviewers to discuss:- How they design fair and reproducible overclocking tests- The impact of timing adjustments, voltage changes, and thermal management- PCIe and controller constraints in SSD overclocking- Stability validation and data integrity risks- How overclocking affects AI, gaming, and content-creation workloads
Modern large language models (LLMs) such as DeepSeek-R1, Llama-3.1-405B, and emerging trillion-parameter architectures are fundamentally constrained by memory capacity, memory bandwidth, and data movement latency. Serving these models increasingly requires distributed inference architectures spanning multiple GPU nodes, where memory hierarchy and interconnect performance become the dominant system bottlenecks.
This session presents a deep technical exploration of how distributed GPU memory systems and high-speed interconnect technologies enable large-scale AI inference. Drawing from a production reference architecture deployed on VMware Private AI infrastructure, the talk examines how GPUDirect RDMA over InfiniBand enables direct GPU-to-GPU memory transfers across nodes while bypassing CPU memory copies and minimizing PCIe overhead.
The session analyzes the architectural building blocks required for scalable AI memory systems, including NVIDIA HGX platforms, NVLink/NVSwitch intra-node GPU fabrics, RDMA-enabled inter-node networking, and distributed GPU memory orchestration across Kubernetes clusters.
The rapid proliferation of AI-driven workloads has led to ever more complex requirements for modern storage subsystems. To meet these demands, SSD controller architectures have evolved into a fragmented landscape, offering a spectrum of types tailored to specific workloads. SSD classes include SLC, TLC or QLC-based flash devices, alongside emerging multi-tier designs. At the advent of some emerging drives exposing tiering directly to applications, integrating these heterogeneous drives remains disruptive to existing software stacks.We introduce a novel, tier-aware software RAID, DT-RAID that eliminates this friction. By transparently monitoring I/O access patterns, our system makes real-time data-placement decisions without requiring application changes. It maps "hot" data to fast (SLC) flash and "cold" data to denser TLC/QLC media and dynamically resizes the storage tiers to fit the workload. Our approach increases throughput, reduces latency, and cuts device wear remarkably, while significantly mitigating latency tails and workload interference. The DT-RAID architecture allows systems to finally harness the full performance and lifetime benefits of heterogeneous drives.
AI’s learning exchanges massive data with SSDs and then stores it in NL-HDDs. Ransomware attack on such storage devices is a risk of kidnaping AI. In Japan (2024), 49% couldn’t be recovered within 1 month. Is 1 month termination in business operation acceptable? It might appear easy to recover manipulated data in data systems using timestamp and a supervising list of which data is stored in which storage device. Crackers manipulate not only data but also the supervising list. If we manually fix the manipulated supervising list, term to recovery will get longer as the number of storage devices increases.Device identification in each storage device may protect the supervising list from manipulation. In a hierarchy of device identification, there are device certificates and manufacturing records in OTP, and PUF, from the bottom. Because of a big remuneration from AI’s kidnapping, an easy solution with OTP makes nonsense. Implementation of PUF into each storage device may cost. We propose Physical Cyber Authentication (PCA) to ensure the same security level as PUF with easy and cheapest implementation like software and with no victim of robustness to temperature change.
Reliability testing is cornerstone of flash memory qualification, but traditional Reliability Demonstration Testing (RDT) requires large sample sizes, long test durations, and high cost, limiting qualification speed as NAND technologies scale. This paper presents a practical framework for Sample Size Reduction (SSR) in reliability testing for modern NAND‑based storage devices.The approach replaces pass/fail assessment with parametric‑driven testing using continuous health indicators such as endurance behaviour, error rate trends, wear‑leveling efficiency, latency stability, and performance degradation. Statistical confidence is preserved through binomial and parametric analysis with risk‑based allocation across life phases, while adaptive stress profiling removes redundant testing without reducing failure sensitivity. Applied across reliability and performance phases, this methodology enables meaningful sample size reduction without compromising qualification requirements, lowering cost and timelines while improving tester utilization.
We are developing technologies to eliminate the memory bottleneck in AI inference on AI‑DC infrastructure. The problem is that even when a large remote memory pool solves storage and cost issues, the data must still be moved to local GPUs, leaving a persistent bottleneck. Demand for remote‑transfer performance keeps rising.In practice, much of AI‑serving cost and latency comes from the GPU‑memory‑network path rather than from computation. In large, highly concurrent services, remote data movement becomes the primary performance and cost bottleneck.Our solution is a new memory system that fundamentally cuts data movement. By computing where data resides and transmitting only the results, we dramatically reduce transfers using Processing‑in‑Memory (PIM) and Processing‑Near‑Memory (PNM). We have built a prototype large‑language‑model (LLM) inference system on PIM/PNM and are expanding it into a next‑generation memory platform. We plan to share our early results.
The circular economy for storage devices—centered on reuse, refurbishment, and recycling—has become increasingly critical as global data demand accelerates. Explosive growth in cloud computing, AI workloads, and edge systems continues to drive unprecedented consumption of both solid state and magnetic storage. As device volumes expand, so does the urgency to mitigate environmental impact and reduce reliance on finite raw materials. In particular, the scarcity and geopolitical vulnerability of rare earth elements and critical minerals used in storage components are pushing the industry toward more sustainable, resource efficient models. We will discuss recent advancements in circularity, including rare earth recovery, and ways that the industry can partner to be more efficient in all aspects of circularity.
Roger Cummings is CEO and President of PEAK:AIO, a platform purpose-built to help enterprise organisations scale, govern, and secure their AI and HPC workloads by transforming commodity hardware into high-performance, software-defined storage systems.Roger brings a proven record of building and scaling technology companies, having led five early-stage businesses through successful acquisitions and raising over $1 billion in combined funding. His expertise spans application infrastructure and AI/ML technologies, with previous executive roles including CEO of FitStack, a University of Wisconsin DevOps intelligence spin-out, and Evidence IQ, an AI/ML company specialising in evidence-based data intelligence.
A recognised thought leader in AI application infrastructure, Roger has co-authored multiple papers on go-to-market strategy and operational excellence, and continues to shape how enterprises approach the challenges of deploying AI at scale.
As data centers increasingly rely on NVMe SSDs in virtualized and multi-tenant environments, the choice between SR-IOV (Single Root I/O Virtualization) and software-based virtualization significantly impacts storage performance. Despite SR-IOV's potential to deliver near-native I/O performance by enabling direct hardware-level access from virtual machines, there is a notable absence of standardized testing methodologies and publicly available performance data for SR-IOV-enabled SSDs.This presentation addresses that gap by introducing a comprehensive testing methodology for evaluating SR-IOV performance on NVMe SSDs. We present a systematic comparison between SR-IOV and software virtualization approaches, demonstrating measurable performance advantages of SR-IOV in key metrics including IOPS, latency, and throughput across various workload profiles. Our methodology covers test environment setup, configuration parameters, workload generation, and result analysis procedures.Attendees will gain a clear understanding of SR-IOV benefits for NVMe storage, a replicable testing framework, and data-driven insights to guide virtualization strategy decisions for their storage infrastructure.
The rise of data lakehouses and hybrid cloud has created new challenges for storing and managing large amounts of data. This talk will delve into the architecture, features, and best practices of Ceph Rados Gateway (RGW), an open-source, software-defined object storage solution. We will explore how RGW enables organizations to build scalable, resilient, and cost-effective object storage systems tailored to their unique requirements.The presentation will begin with an overview of Ceph RGW, highlighting its key components and the benefits of using an open-source object storage solution. We will then discuss upcoming features in the next release along the RGW architecture, focusing on its multi-site capabilities, data durability, and seamless integration with other Ceph services.
Chiplet based architectures are transforming SoC design, but they also upend long standing security assumptions that were implicitly guaranteed by a monolithic die. By disaggregating a monolithic die into multiple, often multi-vendor chiplets, the implicit silicon trust boundary disappears, expanding the attack surface to include chiplet substitution, weak chiplet compromise, and exposed die to die interconnects. This discussion explores why traditional SoC security models do not scale to chiplet systems and introduces a system level security paradigm based on distributed trust with centralized authority.
Following our previous presentation at OCP, we have advanced our ASIC-based CXL-PNM initiative through collaboration with SoC partners, evolving the architecture toward a general-purpose core–based ASIC CXL-PNM platform. This approach extends beyond fixed-function acceleration by embedding programmable compute within a CXL-attached device, enabling flexible offload for memory-intensive and data-centric workloads.In this talk, we will present the architectural evolution and representative use cases spanning data analytics, AI data processing, and memory-bound applications. More importantly, we aim to expand collaboration with researchers and industry partners in areas such as workload co-design, system software enablement, and standards-aligned innovation. Through sharing validated results and open research challenges, we seek to foster a broader ecosystem around CXL-PNM.
PCIe Gen5 NVMe SSDs are now fast enough that parity RAID frequently becomes CPU-bound, pushing teams toward high-frequency, power-hungry processors just to keep pace. Meanwhile, power has become one of the tightest constraints in modern data centers. This talk shares results from a joint Xinnor & Intel validation asking whether energy-efficient CPU cores (Intel Xeon 6 E-cores) can realistically drive high-end NVMe storage performance without compromising latency or predictability.We present benchmark results and the reasoning behind them: how different I/O processing approaches shift the trade-off between peak IOPS and tail latency, why interrupt coalescing can dramatically boost random read throughput while introducing large latency penalties, and how a user-space polling datapath avoids that compromise. Using an 8x PCIe Gen5 NVMe setup, we compare RAID6 and RAID10 across Intel Xeon E-cores and P-cores, translating findings into practical design guidance for building performance-per-watt-optimized NVMe servers and NVMe-oF building blocks.
As AI models grow to ever-larger training datasets and parameter counts, the GPU’s memory is no longer sufficient and storage requirements continue to increase. New solutions are needed to build larger clusters of SSDs that are modular, flexible and high performant.
This presentation explores how large port-count PCIe switches unlock entirely new storage architectures for AI. By consolidating large numbers of SSDs under a single switching fabric, we can improve the current architecture with modular flexibility, unified management, dynamic drive allocation, and reduction of multi-hop latencies.
This presentation introduces Adaptive System-Level Reliability Testing (ASLRT), a dynamic validation framework for improving qualification efficiency in advanced flash storage systems. As memory scaling and firmware complexity increase, traditional pass/fail validation models fail to expose latent degradation mechanisms and condition-dependent firmware behaviors throughout the storage lifecycle. The proposed framework integrates continuous memory health monitoring (e.g., BER, FBC, and error trends), structured system-level corner-case checkers, and AI-driven unsupervised anomaly analytics to quantify device-level anomaly potential during validation. Rather than applying uniform lifecycle stressing, the framework adaptively steers validation depth, stress intensity, and sample allocation toward behaviorally marginal devices identified through system-level indicators. This closed-loop approach concentrates validation resources on marginal units while allowing nominal devices to complete qualification without unnecessary over-testing, enabling earlier exposure of degradation-related reliability risks and improved qualification efficiency without requiring silicon modification.
As SSDs and the storage stack become more advanced, attack surfaces morph and grow. Securing data down to the component level, protecting against supply chain, cyber-attacks and physical theft, has long been a concern in the industry. To that end support for AES Encryption and TCG standards are ‘table stakes’ features for SSD controllers to protect your data but changes are emerging on the horizon as the industry seeks to design in security and stay ahead of the evolving landscape of cyber threats. In this talk, we examine these new features and upcoming changes – including Caliptra, changes to AES, supply chain risk mitigations, and how new storage initiatives may extend your threat models.
A few years ago, some industry report prematurely claimed that CXL was "dead" in the AI era, primarily citing its bandwidth limitations compared to ethernet based interconnect as like NVLink. However, such assessments were narrowly focused on single-turn LLM inference, specifically the decoding stage. In the current landscape of Agentic AI—driven by Reasoning, RAG (Retrieval-Augmented Generation), and Multi-Modal LLMs—the industry is witnessing an explosive growth in KV cache and an urgent need for its efficient storage and sharing.
CXL is now emerging as a critical solution, leveraging its low-latency characteristics alongside large-capacity memory pooling and sharing technologies. This presentation explores how CXL memory enhances the performance of Agentic AI inference. Furthermore, we will introduce additional performance gains and energy efficiency achieved through Process Near Memory (PNM) technologies. Finally, we will share Samsung’s latest System & SW innovations and empirical evaluation results that demonstrate the future of CXL-based AI infrastructure.
In 2026, enterprise IT teams are expected to reduce infrastructure costs, improve operational efficiency, and meet aggressive sustainability goals — all while supporting exponential data growth and uncompromising performance requirements. Storage decisions now have direct financial and environmental consequences.
This session explores:- How modern storage architectures reduce CAPEX and OPEX through consolidation, simplified management, and higher utilization—delivering over 50% efficiency gains, faster ROI, and greater financial predictability without compromising performance or availability.- How storage design lowers power, cooling, and hardware footprint, enabling enterprises to cut environmental impact while maintaining enterprise-class reliability.
We encourage you to challenge traditional storage growth models and look at practical guidance for aligning IT, financial, and ESG priorities through smarter storage strategy, supported by real-world enterprise outcomes.
Emerging AI applications cover a broad spectrum of usage models, each with its own characteristic interaction with storage. The trend toward having a low drive to GPU ratio in compute nodes is pushing storage out of the compute rack and into nearby storage servers. Storage servers are relied upon to deliver capacity, bandwidth, IOPs, and serviceability. What's the right storage architecture? How many different kinds of storage servers do we need? Is there an opportunity for convergence? This talk will frame and address these questions that the broader community is working through.
Object Storage has long been the domain of HDD arrays as the storing and retrieving of large contiguous objects can maximize the utilization of an HDD's available performance while minimizing the cost per TB. Each GPU & CPU generation increases performance dramatically resulting in significantly increased requirements for data ingest performance. Additionally, enabling RDMA for object storage moves data lakes closer to the compute.Historically, the latency of object storage systems has been ignored. However, as GPUs are now accessing object storage directly, we find that HDDs can no longer sustain the requirements for large-scale object storage solution.In this session we explore the performance characteristics of an object storage system on HDDs and QLC SSDs. Attendees will leave this session with an understanding of when to deploy QLC SSDs for their object storage needs and when HDDs are still able to offer "enough" performance.
Refreshment Break
Chair's Remarks
Chair's Remarks
Chair's Remarks
CXL memory is evolving beyond simple capacity expansion for specific workload toward a pooled architecture that enables more flexible and scalable memory disaggregation. By decoupling memory from individual hosts, CXL-based pooling introduces a shared, dynamically allocatable memory layer across heterogeneous compute nodes. This architectural shift provides a practical foundation for tiered memory systems that can transparently support KV cache offloading in large-scale AI inference workloads. In particular, CXL memory pooling enables intermediate-latency tiers between local DRAM and remote storage, optimizing both cost efficiency and performance. As a result, CXL-based tiering emerges as a key enabler for scalable and memory-efficient deployment of next-generation large language models.
Standard SSDs abstract physical data placement, restricting the host to purely logical LBA management. NVMe Flexible Data Placement (FDP) relaxes this barrier, enabling the host to segregate multiple data streams into physical Reclaim Units. This reduces device-internal write amplification, yielding tangible benefits: extended drive endurance, predictable QoS, and improved energy efficiency.Exposing this hardware-software co-design requires architectural evolution within the OS. This talk will explore how the Linux storage stack is being adapted to support spatial data placement. We will cover the new I/O paths and user-space interfaces that are being introduced for application developers and infrastructure architects, enabling alignment between software data layouts and hardware-level isolation to reduce TCO and mitigate latency spikes.
IMC paradigms use memory elements as compute resources, allowing operations such as matrix-vector multiplication to execute directly within the memory array. By eliminating repeated data transfers between the memory and CPU, IMC architectures dramatically reduce latency and energy consumption. As AI models proliferate into edge, industrial, automotive, and secure embedded systems, the demand for compact, non-volatile, and scalable memory technologies becomes critical.Emerging non-volatile memories, particularly Resistive RAM (ReRAM), offer a structurally and physically well-suited platform for analog and mixed-signal in-memory computation. ReRAM’s two-terminal cell structure, scalability below advanced nodes where embedded flash isn’t viable, and compatibility with BEOL integration can enable dense arrays capable of highly parallel computation. This talk will examine market and architectural trends driving IMC adoption, the technical requirements for commercially viable solutions, and why ReRAM presents a practical path from research prototypes to production-ready AI systems.
The semiconductor industry is projected to consume 237 TWh of electricity by 2030, intensifying pressure on already strained power resources. Cooling water systems can account for up to 40% of this energy use, making thermal management essential for sustainability and cost control. By optimizing these systems, manufacturers can recover and repurpose waste heat, reducing environmental impact.
Industry leaders are implementing strategies like heat recovery from ultrapure and process cooling water systems, chiller bypasses, and sector coupling. These actions lower operational expenses and carbon emissions. The expansion of district heating networks in Europe creates new opportunities to supply recovered heat to local communities and industries, transforming waste into clean energy.
Adapting successful practices from data centers and heavy industry, the semiconductor sector is leveraging cross-industry collaboration to drive innovation. Collectively, these approaches offer a holistic solution for efficient thermal management, helping the industry stay competitive, resilient, and environmentally responsible in a resource-constrained future.
Gaming SSDs are a popular high-end segment and typically include very high synthetic benchmark performance as well as additional features. However, gaming workloads are not always optimized for storage. To provide the best user experience for games, we characterized the way games use storage and developed a new methodology for benchmarking storage for games, and games performance.Our research shows how typical QLC SSDs can be used for games, and what features matter to user experience in a gaming workload.We also present an A/B study with gamers on popular PC games, demonstrating that modern QLC SSDs can provide a comparable gaming experience for frame rate, heavy graphics and level transitions.
This presentation explores how CXL pooling/sharing could enable KV cache sharing across a memory hierarchy that includes VRAM, local DRAM, local SSD, and an ICMS-like tier. We focus on latency-sensitive and memory-capacity-hungry inference patterns (e.g., multi-turn serving, multi-adapter workloads) where KV reuse and prefix overlap are prominent.The talk is concept-driven and grounded in published literature and public reports. We summarize expected benefits, outline deployment constraints (ecosystem maturity, correctness/coherence boundaries, software support), and discuss how to prioritize a deployable subset of CXL capabilities rather than assuming “full spec implementation” is always optimal.
Inference devices for the edge with low power, high performance, and small footprint are in high demand. Recent advances in analog compute-in-memory (a-CIM) to solve the data communication bottlenecks in neural network had re-ignited significant research interest in using various types of non-volatile memory for a-CIM. SST’s split-gate flash (ESF) had been successfully developed and deployed in numerous stand-alone and embedded NOR products such as micro-controllers (MCU), programmable logic devices, internet of things, smart and secure cards. In this work, the implementation of a 2Mb ESF neuromorphic memory array as an analog vector matrix multiplier (VMM) will be discussed. The silicon result of the VMM fabricated using an ESF 28nn process will be presented along with design concepts and testing techniques.Successful implementations of ESF neuromorphic memory in customers products will be showcased.
This talk covers the design and architectural opportunities for sustainable and efficient AI Factory at scale- Achieving Net Zero: Define and evaluate key categories of Greenhouse Gas Emission- AI Factory Overview: Share the AI full stack spectrum and strategies that shape hardware and software choices- In Deployment Refactoring: Innovative examples to prolonging AI Factory power, cooling and architectureThere’s no one‑size‑fits‑all. You’ll leave with the practical know‑how to maximize your AI Factory—delivering efficiency and sustainability at gigawatt scale.
This presentation proposes a novel approach on how performance and scale of such distributed inference models can be significantly improved with use of CXL attached memory instead of traditional storage. We present an architecture that can enables
ANAFLASH presents a time-domain compute-in-memory IP based on a proprietary mixed signal technology. The architecture uses a weight-stationary array to compute deep neural network operations with very high parallelism and power efficiency. The fabricated mixed-signal technology achieves 50 fJ per 4-bit MAC operation for dense activity networks in 22 nm CMOS. It can operate near threshold supply voltage and save even more energy. The measured silicon results are less than 100 microwatts of power consumption for real-time person detection. The IP is suitable to be embedded in a standard logic process with marginal area overhead near NAND flash memory for low power and high bandwidth AI inference.
While QLC NAND has become a key enabler for high-capacity SSDs, its slower read/write performance versus SLC and TLC remains a barrier for many real-world use cases. To bridge this gap, modern controllers increasingly use TLC mode on QLC NAND for critical portions of the workload.
This presentation reviews the role of QLC in today’s storage landscape and explains why relying solely on native QLC behavior leads to inconsistent user experience. We then introduce a TLC-mode–based architecture that selectively promotes hot or latency-sensitive data to TLC operation, with different TLC-mode optimization strategies tailored to specific application scenarios. Use cases and measurement examples will illustrate how this approach improves system responsiveness in boot, application launch, content creation, and light AI workloads, while preserving the cost and capacity advantages that make QLC attractive.
As AI transitions toward long-context models, fixed HBM capacity creates a critical "memory wall" and resource underutilization. This session introduces an architectural solution: scaling KV caches through transparent CXL memory-storage subsystems. We present a hardware-managed framework that unifies DRAM and NAND into a single, byte-addressable "InfiniteMemory" tier. Implemented via a CXL backend for LMCache, this approach is compared against RDMA-based frameworks such as NIXL and Mooncake.Unlike RDMA-based disaggregation—often limited by software overhead and network jitter—our CXL architecture manages memory mapping and latency masking directly within the hardware controller. We will explore:- Performance: CXL as a cost-efficient, high-performance alternative to RDMA networks.- Latency Management: Hardware-level techniques to mask NAND latency for seamless LLM inference.- TCO Optimization: Offloading memory from GPU to enable multi-terabyte, platform-agnostic KV cache.This presentation provides an architectural insight for building next-generation, memory-centric datacenters for the future of AI infrastructure.
Today’s computing systems are facing a number of challenging roadblocks, and these will bring new memory types – MRAM, ReRAM, FRAM, and PCM – into widespread use in all levels of computing system. Join this session to learn how these technologies promise to change the way that tomorrow’s computers will approach data processing as they provide persistent storage not only near the processor, but even within the processor itself. Learn the effect this will have on processor price/performance and how system architecture will take advantage of these new technologies to manage data in altogether new and better ways.
Speaker to be determined.
Close of conference.