Is Temperature Sample Efficient for Softmax Gaussian Mixture of Experts?

Is Temperature Sample Efficient for Softmax Gaussian Mixture of Experts?

2024 | Huy Nguyen¹, Pedram Akbarian², Nhat Ho¹
This paper investigates the sample efficiency of temperature in dense-to-sparse gating Gaussian mixture of experts (MoE). The dense-to-sparse gating MoE uses temperature to control the softmax weight distribution and sparsity during training, promoting expert specialization. However, previous theoretical analysis of sparse MoE has not fully explored the dense-to-sparse gating MoE. The paper aims to analyze the impact of dense-to-sparse gating on maximum likelihood estimation under Gaussian MoE. The paper shows that due to interactions between temperature and other model parameters via partial differential equations, the convergence rates of parameter estimation are slower than any polynomial rates, potentially as slow as $ \mathcal{O}(1/\log(n)) $, where n is the sample size. To address this, the paper proposes a novel activation dense-to-sparse gate, which routes the output of a linear layer to an activation function before delivering them to the softmax function. By imposing linear independence conditions on the activation function and its derivatives, the paper shows that parameter estimation rates are significantly improved to polynomial rates. The paper also conducts a simulation study to empirically validate the theoretical results. The main contributions are: (1) establishing the convergence rate of density estimation under the Total Variation distance, and (2) proposing an activation dense-to-sparse gate to improve parameter estimation rates. The paper shows that under exact-specified settings, the estimation rates for $ \beta_{1i}^{*}, \tau^{*} $ are slower than any polynomial rates, while those for $ a_{i}^{*}, b_{i}^{*}, \nu_{i}^{*} $ are significantly faster. Under over-specified settings, the rates for $ \beta_{1i}^{*}, \tau^{*} $ remain unchanged, while those for $ a_{i}^{*} $ become slower than any polynomial rates due to the PDE (4). The estimation rates for $ b_{i}^{*}, \nu_{i}^{*} $ depend on the solvability of a system of polynomial equations.This paper investigates the sample efficiency of temperature in dense-to-sparse gating Gaussian mixture of experts (MoE). The dense-to-sparse gating MoE uses temperature to control the softmax weight distribution and sparsity during training, promoting expert specialization. However, previous theoretical analysis of sparse MoE has not fully explored the dense-to-sparse gating MoE. The paper aims to analyze the impact of dense-to-sparse gating on maximum likelihood estimation under Gaussian MoE. The paper shows that due to interactions between temperature and other model parameters via partial differential equations, the convergence rates of parameter estimation are slower than any polynomial rates, potentially as slow as $ \mathcal{O}(1/\log(n)) $, where n is the sample size. To address this, the paper proposes a novel activation dense-to-sparse gate, which routes the output of a linear layer to an activation function before delivering them to the softmax function. By imposing linear independence conditions on the activation function and its derivatives, the paper shows that parameter estimation rates are significantly improved to polynomial rates. The paper also conducts a simulation study to empirically validate the theoretical results. The main contributions are: (1) establishing the convergence rate of density estimation under the Total Variation distance, and (2) proposing an activation dense-to-sparse gate to improve parameter estimation rates. The paper shows that under exact-specified settings, the estimation rates for $ \beta_{1i}^{*}, \tau^{*} $ are slower than any polynomial rates, while those for $ a_{i}^{*}, b_{i}^{*}, \nu_{i}^{*} $ are significantly faster. Under over-specified settings, the rates for $ \beta_{1i}^{*}, \tau^{*} $ remain unchanged, while those for $ a_{i}^{*} $ become slower than any polynomial rates due to the PDE (4). The estimation rates for $ b_{i}^{*}, \nu_{i}^{*} $ depend on the solvability of a system of polynomial equations.
Reach us at info@study.space