Score: 0

Architectural Trade-offs in Small Language Models Under Compute Constraints

Published: December 24, 2025 | arXiv ID: 2512.20877v1

By: Shivraj Singh Bhatti

Potential Business Impact:

Makes small AI models smarter with less computer power.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

We present a systematic empirical study of small language models under strict compute constraints, analyzing how architectural choices and training budget interact to determine performance. Starting from a linear next-token predictor, we progressively introduce nonlinearities, self-attention, and multi-layer transformer architectures, evaluating each on character-level modeling of Tiny Shakespeare and word-level modeling of Penn Treebank (PTB) and WikiText-2. We compare models using test negative log-likelihood (NLL), parameter count, and approximate training FLOPs to characterize accuracy-efficiency trade-offs. Our results show that attention-based models dominate MLPs in per-FLOP efficiency even at small scale, while increasing depth or context without sufficient optimization can degrade performance. We further examine rotary positional embeddings (RoPE), finding that architectural techniques successful in large language models do not necessarily transfer to small-model regimes.