Shubham Subhash Deshmukh | Staff Engineer
Samsung Semiconductor India Research

Shubham Subhash Deshmukh is a Staff Engineer (MLE) at Samsung Semiconductor India Research (SSIR), Bengaluru. He has over 8 years of working experience in Machine learning, LLM, inference optimization, and large-scale AI systems. Shubham is the first author for 6+ research papers across ACM and IEEE venues (including INDICON, APSLOPS, SNPD, and WINTECHCON), and he is a contributor to the open-source Scalable Memory Development Kit (SMDK) for scalable, disaggregated AI infrastructure and inference efficiency. He completed his master's in Data Science and Engineering from BITS Pilani and holds a Bachelors Degree in Computer Science and a Diploma in Computer Engineering . Previously, he worked as an NLP Engineer at Buddhimed Technologies, developing clinical NLP and SOTA OCR-based document pipelines, and as an Analyst at Concentrix Global Analytics, building NLP pipelines for large enterprise datasets. He has strong skills in CXL (Compute Express Link), Disaggregated memory systems, Tiered memory architectures, Memory-aware scheduling, GPU/CPU memory optimization, PyTorch, Performance analysis, Large-scale data processing, and Experimental benchmarking.

Appearances:

Future of Memory and Storage - Day 2 @ 09:05

CXL-SUBLET: A Key-Value cache management system for multi-user LLM serving

CXL-SUBLET is a key-value cache management system for multi-user LLM serving that keeps KV blocks as session state across GPU HBM and a shared CXL.mem pool. It uses session leases to move idle-session KV blocks from HBM to CXL.mem, and budgeted hydration to restore only selected blocks within a fixed resume-time budget, avoiding bulk transfer stalls and controlling CXL bandwidth. CXL-SUBLET maintains a CXL-resident directed acyclic graph (DAG) that stores shared prefix KV blocks once and reuses them at block granularity across users and replicas, while isolating each user’s subsequent tokens via copy-on-write. In our 7B FP16 setup, cache cost is 512KB/token (2GB per 4K-token session), enabling ~500 idle 4K-token sessions per TB of CXL.mem- ~12× higher idle-session retention capacity than an 80GB HBM-only configuration (~40 sessions)—while reserving HBM for active compute. Our evaluation is SLO-driven: we report max concurrent sessions per GPU while meeting resume targets (P95≤200ms, P99≤400ms), along with prefix hit rate, KV bytes restored per resume, and CXL data movement (total bytes moved), which together capture transfer overhead under the SLOs.

Shubham Subhash Deshmukh, Staff Engineer, Samsung Semiconductor India Research

Paulami Das, Staff Engineer, Samsung Electronics

last published: 19/May/26 18:25 GMT

back to speakers