Tong Zhang | Chief Scientist
ScaleFlux

Tong Zhang, Chief Scientist, ScaleFlux

Dr. Tong Zhang is a recognized leader in data storage systems and VLSI signal processing, with more than 25 years of research experience and significant technical contributions to the field. He is a co-founder and Chief Scientist of ScaleFlux, where he leads technology pathfinding for next-generation memory and storage devices supporting AI and data-intensive computing. Dr. Zhang is also a Professor at Rensselaer Polytechnic Institute (RPI), where he has graduated 21 Ph.D. students and mentored a broad range of research in storage systems, signal processing, and computer architecture. He has authored or co-authored over 160 technical papers, with more than 7,000 citations and an h-index of 45, and has served in leadership roles for major IEEE/ACM conferences and journals. His research includes pioneering contributions to enabling the widespread adoption of low-density parity-check (LDPC) codes in commercial HDDs and SSDs, as well as establishing the research area of flash memory signal processing. His work has had lasting impact on both academic research and industry practice in modern data storage technologies. Dr. Zhang received his B.S. and M.S. degrees in Electrical Engineering from Xi’an Jiaotong University in 1995 and 1998, respectively, and his Ph.D. in Electrical and Computer Engineering from the University of Minnesota in 2002. He is a Fellow of the IEEE.

Appearances:



Future of Memory and Storage - Day 1 @ 10:35

Reinventing CXL Memory for RAS and TCO

CXL memory is emerging as a foundational building block for next-generation data center systems, but scaling capacity cost-effectively requires simultaneous advances in reliability, availability, serviceability (RAS), and total cost of ownership (TCO). This talk presents two complementary innovations targeting CXL memory expansion. First, we introduce a DRAM-oriented Reed–Solomon list decoding architecture that goes beyond minimum-distance decoding without inventing new ECC codes, significantly strengthening DRAM fault tolerance while maintaining ultra-low latency and high throughput.

Second, we present an in-line, data-dependent adaptive compression architecture designed for CXL.mem devices that expands effective memory capacity while keeping read latency well below 250ns and avoiding CPU overhead. By reducing intra-CXL DRAM read/write amplification and improving effective bandwidth utilization, this approach lowers DRAM footprint and system cost without compromising performance. Together, these innovations outline a practical path for improving both RAS and TCO in future CXL memory systems.

Future of Memory and Storage - Day 3 @ 08:35

Breaking the HBM Cost Wall for Al Inference

HBM has become the dominant cost driver in large-scale AI inference systems. Capacity pressure stems not only from large model weights, particularly in Mixture-of-Experts (MoE) architectures, but also from rapidly growing KV cache during long-context decoding. Meanwhile, strict device-level reliability constraints inflate HBM $/GB. This talk presents a unified architecture to reduce effective HBM cost per token by addressing both reliability overhead and memory residency pressure without changing the standard HBM interface.We treat reliability as a controller-defined resource, enabling relaxed raw BER targets through coarse-grained protection and selective safeguarding of critical data fields. We further position HBM as a high-bandwidth cache backed by larger LPDDR/CXL memory. Dynamic KV placement distributes state across tiers to aggregate bandwidth under capacity constraints, while inactive MoE experts remain compressed in the lower tier to reduce footprint and migration bandwidth. Together, these mechanisms lower HBM cost per token while preserving throughput and correctness.

Future of Memory and Storage - Day 3 @ 10:05

When Flash Becomes Memory: Rethinking the Five-Minute Rule for AI Infrastructure

In 1987, Jim Gray’s Five-Minute Rule provided a simple economic guideline for deciding when data should reside in DRAM versus storage. Revisited multiple times over four decades, the break-even interval consistently remained on the order of minutes, reinforcing flash as a secondary storage tier. This talk reexamines the rule from first principles in the AI era. We introduce a feasibility-aware framework that integrates host processor cost, DRAM bandwidth and capacity, device-level NAND timing, channel parallelism, and realistic SSD IOPS/$ scaling. We show that when GPU-centric hosts are paired with Storage-Next SSDs delivering 50M+ small-block IOPS, the DRAM↔flash caching threshold collapses from minutes to seconds.This shift promotes NAND flash from a passive capacity layer to an active extension tier of memory, with GPUs emerging as high-throughput I/O engines. We will present analytical insights, device-level modeling results, and system-level implications for AI infrastructure, including vector databases, recommender systems, and large-scale inference. The result is a new provisioning framework that redefines memory-storage balance for modern AI workloads.

last published: 19/May/26 18:25 GMT

back to speakers

 

TO EXHIBIT OR SPONSOR

 

TO SPEAK

 

FMS website sponsored by XCena

 

Marketing & Press