This document explores the challenges associated with training deep feedforward neural networks, specifically investigating why standard gradient descent with random initialization performs poorly. The authors examine the impact of various non-linear activation functions, like sigmoid, hyperbolic tangent, and a new softsign function, on network performance and the issue of unit saturation. They further analyze how activations and gradients change across layers and during training, leading to the proposal of a novel initialization scheme designed to accelerate convergence. The findings suggest that appropriate activation functions and initialization techniques are crucial for improving the learning dynamics and overall effectiveness of deep neural networks.
Source: https://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf