[slides and audio] On Least Square Estimation in Softmax Gating Mixture of Experts

This paper investigates the performance of least squares estimators (LSE) in a deterministic softmax gating mixture of experts (MoE) model, where data are sampled according to a regression model. The authors establish a condition called strong identifiability to characterize the convergence behavior of various types of expert functions. They demonstrate that the rates for estimating strongly identifiable experts, such as feed-forward networks with sigmoid and tanh activation functions, are significantly faster than those of polynomial experts, which exhibit a surprising slow estimation rate. The findings have important practical implications for expert selection, suggesting that widely used expert functions like feed-forward networks with sigmoid and tanh activations enjoy faster estimation rates compared to other types of experts. Additionally, the paper reveals that polynomial experts, including linear experts, are not suitable for MoE models due to their slow estimation rates. The analysis also highlights the importance of ensuring that all expert parameters are non-zero to achieve faster estimation rates, as the presence of zero parameters can lead to slower rates. The paper concludes with a discussion of limitations and future directions, including the need to analyze misspecified settings and the impact of network depth on expert estimation.This paper investigates the performance of least squares estimators (LSE) in a deterministic softmax gating mixture of experts (MoE) model, where data are sampled according to a regression model. The authors establish a condition called strong identifiability to characterize the convergence behavior of various types of expert functions. They demonstrate that the rates for estimating strongly identifiable experts, such as feed-forward networks with sigmoid and tanh activation functions, are significantly faster than those of polynomial experts, which exhibit a surprising slow estimation rate. The findings have important practical implications for expert selection, suggesting that widely used expert functions like feed-forward networks with sigmoid and tanh activations enjoy faster estimation rates compared to other types of experts. Additionally, the paper reveals that polynomial experts, including linear experts, are not suitable for MoE models due to their slow estimation rates. The analysis also highlights the importance of ensuring that all expert parameters are non-zero to achieve faster estimation rates, as the presence of zero parameters can lead to slower rates. The paper concludes with a discussion of limitations and future directions, including the need to analyze misspecified settings and the impact of network depth on expert estimation.

On Least Square Estimation in Softmax Gating Mixture of Experts

2024 | Huy Nguyen, Nhat Ho, Alessandro Rinaldo