Score: 2

Pianist Transformer: Towards Expressive Piano Performance Rendering via Scalable Self-Supervised Pre-Training

Published: December 2, 2025 | arXiv ID: 2512.02652v1

By: Hong-Jie You , Jie-Jing Shao , Xiao-Wen Yang and more

Potential Business Impact:

Makes music sound like a real person played it.

Business Areas:
Musical Instruments Media and Entertainment, Music and Audio

Existing methods for expressive music performance rendering rely on supervised learning over small labeled datasets, which limits scaling of both data volume and model size, despite the availability of vast unlabeled music, as in vision and language. To address this gap, we introduce Pianist Transformer, with four key contributions: 1) a unified Musical Instrument Digital Interface (MIDI) data representation for learning the shared principles of musical structure and expression without explicit annotation; 2) an efficient asymmetric architecture, enabling longer contexts and faster inference without sacrificing rendering quality; 3) a self-supervised pre-training pipeline with 10B tokens and 135M-parameter model, unlocking data and model scaling advantages for expressive performance rendering; 4) a state-of-the-art performance model, which achieves strong objective metrics and human-level subjective ratings. Overall, Pianist Transformer establishes a scalable path toward human-like performance synthesis in the music domain.

Country of Origin
🇨🇳 China

Repos / Data Links

Page Count
24 pages

Category
Computer Science:
Sound