Score: 2

EdgeInfinite-Instruct: Bridging SFT-Based Optimization and NPU-Level Efficiency for Edge Devices

Published: August 1, 2025 | arXiv ID: 2508.00370v1

By: Jiyu Chen , Poh Seng Lim , Shuang Peng and more

BigTech Affiliations: MediaTek

Potential Business Impact:

Makes smart AI work faster on phones.

Deploying Transformer-based large language models (LLMs) on resource-constrained edge devices for long-sequence tasks remains challenging due to the quadratic time complexity of self-attention and growing Key-Value (KV) cache demands. While existing KV cache optimizations improve memory efficiency, they often fail to reduce time to first token (TTFT) and may degrade performance through token pruning. Alternative sequence modeling architectures address some of these limitations, but typically require full retraining and lack infrastructure support. EdgeInfinite offers an efficient solution by fine-tuning only a small subset of parameters, maintaining quality while reducing both computational and memory costs, including improved TTFT. However, its instruction-following ability is limited, and it lacks mobile-specific optimizations. To address these issues, we propose EdgeInfinite-Instruct, which introduces a Segmented Supervised Fine-Tuning (S-SFT) strategy tailored to long-sequence tasks such as summarization and question answering. We further optimized EdgeInfinite-Instruct for efficient deployment on edge NPUs by employing fine-grained post-training quantization (PTQ) to reduce computational demands while maintaining accuracy, and by implementing a fixed-shape computation graph that balances memory usage and on-device efficiency through scenario-specific customization of input token and cache sizes. Experiments on long-context benchmarks and real-world mobile tasks show that our approach improves domain-specific performance while maintaining efficiency on NPU-accelerated edge devices.

Scaling Towards the Information Boundary of Instruction Set: InfinityInstruct-Subject Technical Report

Artificial Intelligence

Teaches computers to follow harder instructions better.

9 Jul 2025 1

87%

Eliciting Fine-Tuned Transformer Capabilities via Inference-Time Techniques

Machine Learning (CS)

Lets computers learn without retraining them.

9 Jun 2025 0

87%

Infinite-Instruct: Synthesizing Scaling Code instruction Data with Bidirectional Synthesis and Static Verification

Computation and Language

Teaches computers to write better code.

29 May 2025 2

View PDF Login to Bookmark

Country of Origin

🇹🇼 🇨🇳 Taiwan, China

Page Count

9 pages

EdgeInfinite-Instruct: Bridging SFT-Based Optimization and NPU-Level Efficiency for Edge Devices

Makes smart AI work faster on phones.

Technical Abstract

Scaling Towards the Information Boundary of Instruction Set: InfinityInstruct-Subject Technical Report

Eliciting Fine-Tuned Transformer Capabilities via Inference-Time Techniques

Infinite-Instruct: Synthesizing Scaling Code instruction Data with Bidirectional Synthesis and Static Verification