PHOTON: Hierarchical Autoregressive Modeling for Lightspeed and Memory-Efficient Language Generation
By: Yuma Ichikawa , Naoya Takagi , Takumi Nakagawa and more
Potential Business Impact:
Makes AI understand long stories much faster.
Transformers operate as horizontal token-by-token scanners; at each generation step, the model attends to an ever-growing sequence of token-level states. This access pattern increases prefill latency and makes long-context decoding increasingly memory-bound, as KV-cache reads and writes dominate inference throughput rather than arithmetic computation. We propose Parallel Hierarchical Operation for Top-down Networks (PHOTON), a hierarchical autoregressive model that replaces flat scanning with vertical, multi-resolution context access. PHOTON maintains a hierarchy of latent streams: a bottom-up encoder progressively compresses tokens into low-rate contextual states, while lightweight top-down decoders reconstruct fine-grained token representations. Experimental results show that PHOTON is superior to competitive Transformer-based language models regarding the throughput-quality trade-off, offering significant advantages in long-context and multi-query tasks. This reduces decode-time KV-cache traffic, yielding up to $10^{3}\times$ higher throughput per unit memory.
Similar Papers
Hierarchical Autoregressive Transformers: Combining Byte- and Word-Level Processing for Robust, Adaptable Language Models
Computation and Language
Computers understand any text, even with mistakes.
Fast Inference via Hierarchical Speculative Decoding
Machine Learning (CS)
Makes AI write stories much faster.
Fast Inference via Hierarchical Speculative Decoding
Machine Learning (CS)
Makes AI write faster by checking its work.