Architectural Trade-offs in Small Language Models Under Compute Constraints
By: Shivraj Singh Bhatti
Potential Business Impact:
Makes small AI models smarter with less computer power.
We present a systematic empirical study of small language models under strict compute constraints, analyzing how architectural choices and training budget interact to determine performance. Starting from a linear next-token predictor, we progressively introduce nonlinearities, self-attention, and multi-layer transformer architectures, evaluating each on character-level modeling of Tiny Shakespeare and word-level modeling of Penn Treebank (PTB) and WikiText-2. We compare models using test negative log-likelihood (NLL), parameter count, and approximate training FLOPs to characterize accuracy-efficiency trade-offs. Our results show that attention-based models dominate MLPs in per-FLOP efficiency even at small scale, while increasing depth or context without sufficient optimization can degrade performance. We further examine rotary positional embeddings (RoPE), finding that architectural techniques successful in large language models do not necessarily transfer to small-model regimes.
Similar Papers
Scaling Intelligence: Designing Data Centers for Next-Gen Language Models
Hardware Architecture
Builds faster, cheaper computer centers for giant AI.
System-performance and cost modeling of Large Language Model training and inference
Hardware Architecture
Makes big AI models train and run cheaper.
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models
Computation and Language
Makes AI smarter and faster to use.