Vishwas Saxena is a senior engineering leader and Distinguished Engineer at Sandisk with deep expertise in AI, machine learning, security, and system‑level software. He has architected and led multiple products across emerging technology areas and brings over 25 years of industry experience spanning engineering, architecture, and innovation.
On edge devices, KV cache reuse is a critical optimization for reducing prefill latency and memory bandwidth in large language model (LLM) inference. Existing systems support longest prefix reuse, which fails to capture the partial, shifted, suffix, and mid span overlaps common in real world workloads such as conversational AI, retrieval augmented generation (RAG), and templated prompts. This paper presents a hierarchical span indexing architecture that enables efficient detection and validation of arbitrary reusable token intervals without incurring quadratic storage costs.
Our approach combines three components: (1) a segment tree representation with composable rolling hashes for logarithmic time reconstruction of arbitrary span fingerprints, (2) prefix and suffix tries for fast boundary match discovery, and (3) a coarse block level hash index for constant time retrieval of interior span candidates.
Together, these structures enable robust reuse of KV cache segments even when requested spans are shifted or misaligned relative to stored segments. The system requires only linear storage per sequence, 30% lower prefill latency with 2× KV reuse for scalable edge LLM serving