Score: 1

Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate

Published: July 8, 2025 | arXiv ID: 2507.07129v1

By: A. Bochkov

Potential Business Impact:

Builds smarter AI by combining and growing parts.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

The prevailing paradigm for scaling large language models (LLMs) involves monolithic, end-to-end training, a resource-intensive process that lacks flexibility. This paper explores an alternative, constructive approach to model development, built upon the foundation of non-trainable, deterministic input embeddings. In prior [1], we established that high-level semantic reasoning can emerge in Transformers using frozen embeddings derived from the visual structure of Unicode glyphs. Here, we demonstrate that this fixed representational substrate acts as a universal "docking port," enabling two powerful and efficient scaling paradigms: seamless modular composition and progressive layer-wise growth. First, we show that specialist models trained on disparate datasets (e.g., Russian and Chinese text) can be merged into a single, more capable Mixture-of-Experts (MoE) model, post-training, with zero architectural modification. This is achieved by simply averaging their output logits. The resulting MoE model exhibits immediate performance improvements on reasoning benchmarks like MMLU, surpassing its constituent experts without catastrophic forgetting. Second, we introduce a layer-wise constructive training methodology, where a deep Transformer is "grown" by progressively stacking and training one layer at a time. This method demonstrates stable convergence and a clear correlation between model depth and the emergence of complex reasoning abilities, such as those required for SQuAD. Our findings suggest a paradigm shift from monolithic optimization towards a more biological or constructive model of AI development, where complexity is built incrementally and modules can be composed freely. This opens new avenues for resource-efficient scaling, continual learning, and a more democratized ecosystem for building powerful AI systems. We release all code and models to facilitate further research.

Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations

Computation and Language

Lets computers understand words without learning their meaning.

7 Jul 2025 1

88%

DeepInsert: Early Layer Bypass for Efficient and Performant Multimodal Understanding

CV and Pattern Recognition

Makes AI learn faster and use less power.

27 Apr 2025 0

88%

Scaling Reasoning without Attention

Machine Learning (CS)

Makes computers think smarter and faster.

28 May 2025 2

View PDF Login to Bookmark

Repos / Data Links

github.com github.com

Page Count

8 pages

Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate

Builds smarter AI by combining and growing parts.

Technical Abstract

Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations

DeepInsert: Early Layer Bypass for Efficient and Performant Multimodal Understanding

Scaling Reasoning without Attention