On Least Square Estimation in Softmax Gating Mixture of Experts

On Least Square Estimation in Softmax Gating Mixture of Experts

2024 | Huy Nguyen, Nhat Ho, Alessandro Rinaldo
This paper investigates the performance of least squares estimators (LSE) in a deterministic softmax gating mixture of experts (MoE) model, where data are generated according to a regression model. The study introduces the concept of "strong identifiability" to characterize the convergence behavior of expert functions. It demonstrates that expert functions with strong identifiability, such as feed-forward networks with sigmoid and tanh activation functions, exhibit faster estimation rates compared to polynomial experts, which have slower rates. The paper also analyzes ridge experts, showing that when activation functions are strongly independent, estimation rates can be faster, but if some parameters are zero, rates can become as slow as $ \mathcal{O}_{P}(1/\log(n)) $. The results highlight the importance of expert selection in practical applications and show that polynomial experts are not optimal for MoE models due to their slow estimation rates. The study provides theoretical insights into the convergence properties of LSE under deterministic MoE models and emphasizes the role of strong identifiability and strong independence conditions in achieving efficient expert estimation.This paper investigates the performance of least squares estimators (LSE) in a deterministic softmax gating mixture of experts (MoE) model, where data are generated according to a regression model. The study introduces the concept of "strong identifiability" to characterize the convergence behavior of expert functions. It demonstrates that expert functions with strong identifiability, such as feed-forward networks with sigmoid and tanh activation functions, exhibit faster estimation rates compared to polynomial experts, which have slower rates. The paper also analyzes ridge experts, showing that when activation functions are strongly independent, estimation rates can be faster, but if some parameters are zero, rates can become as slow as $ \mathcal{O}_{P}(1/\log(n)) $. The results highlight the importance of expert selection in practical applications and show that polynomial experts are not optimal for MoE models due to their slow estimation rates. The study provides theoretical insights into the convergence properties of LSE under deterministic MoE models and emphasizes the role of strong identifiability and strong independence conditions in achieving efficient expert estimation.
Reach us at info@study.space
[slides] On Least Square Estimation in Softmax Gating Mixture of Experts | StudySpace