24 Jun 2024 | Huy Nguyen, Pedram Akbarian, Nhat Ho
This paper investigates the impact of the dense-to-sparse gating mechanism on the convergence rates of maximum likelihood estimation in Gaussian mixture of experts (MoE) models. The authors find that the density estimation rate is parametric with respect to the sample size, but the parameter estimation rates are significantly slower, potentially as slow as \( \mathcal{O}(1/\log(n)) \), due to interactions between the softmax temperature and other model parameters through partial differential equations. To address this issue, they propose an activation dense-to-sparse gate, which routes the output of a linear layer through an activation function before applying the softmax. By imposing linear independence conditions on the activation function and its derivatives, they show that the parameter estimation rates improve to polynomial rates. The theoretical findings are validated through numerical experiments, demonstrating the effectiveness of the proposed activation dense-to-sparse gate in enhancing sample efficiency and improving parameter estimation rates. The paper also discusses practical implications for expert selection, expert estimation, misspecified settings, and model design, highlighting the benefits of using sophisticated activation functions in gating mechanisms.This paper investigates the impact of the dense-to-sparse gating mechanism on the convergence rates of maximum likelihood estimation in Gaussian mixture of experts (MoE) models. The authors find that the density estimation rate is parametric with respect to the sample size, but the parameter estimation rates are significantly slower, potentially as slow as \( \mathcal{O}(1/\log(n)) \), due to interactions between the softmax temperature and other model parameters through partial differential equations. To address this issue, they propose an activation dense-to-sparse gate, which routes the output of a linear layer through an activation function before applying the softmax. By imposing linear independence conditions on the activation function and its derivatives, they show that the parameter estimation rates improve to polynomial rates. The theoretical findings are validated through numerical experiments, demonstrating the effectiveness of the proposed activation dense-to-sparse gate in enhancing sample efficiency and improving parameter estimation rates. The paper also discusses practical implications for expert selection, expert estimation, misspecified settings, and model design, highlighting the benefits of using sophisticated activation functions in gating mechanisms.