Information-Theoretic Greedy Layer-wise Training for Traffic Sign Recognition
By: Shuyan Lyu, Zhanzimo Wu, Junliang Du
Potential Business Impact:
Trains AI faster and with less memory.
Modern deep neural networks (DNNs) are typically trained with a global cross-entropy loss in a supervised end-to-end manner: neurons need to store their outgoing weights; training alternates between a forward pass (computation) and a top-down backward pass (learning) which is biologically implausible. Alternatively, greedy layer-wise training eliminates the need for cross-entropy loss and backpropagation. By avoiding the computation of intermediate gradients and the storage of intermediate outputs, it reduces memory usage and helps mitigate issues such as vanishing or exploding gradients. However, most existing layer-wise training approaches have been evaluated only on relatively small datasets with simple deep architectures. In this paper, we first systematically analyze the training dynamics of popular convolutional neural networks (CNNs) trained by stochastic gradient descent (SGD) through an information-theoretic lens. Our findings reveal that networks converge layer-by-layer from bottom to top and that the flow of information adheres to a Markov information bottleneck principle. Building on these observations, we propose a novel layer-wise training approach based on the recently developed deterministic information bottleneck (DIB) and the matrix-based R\'enyi's $\alpha$-order entropy functional. Specifically, each layer is trained jointly with an auxiliary classifier that connects directly to the output layer, enabling the learning of minimal sufficient task-relevant representations. We empirically validate the effectiveness of our training procedure on CIFAR-10 and CIFAR-100 using modern deep CNNs and further demonstrate its applicability to a practical task involving traffic sign recognition. Our approach not only outperforms existing layer-wise training baselines but also achieves performance comparable to SGD.
Similar Papers
A Generalized Information Bottleneck Theory of Deep Learning
Machine Learning (CS)
Helps computers learn better by understanding feature connections.
Stabilizing Information Flow Entropy: Regularization for Safe and Interpretable Autonomous Driving Perception
Machine Learning (CS)
Makes self-driving cars see problems better.
Mixture of Balanced Information Bottlenecks for Long-Tailed Visual Recognition
CV and Pattern Recognition
Helps computers recognize many things, even rare ones.