Understanding Speeding Up Distributed Machine Learning Using Codes

This paper explores the use of coding theory to enhance the performance of distributed machine learning algorithms, focusing on two key components: matrix multiplication and data shuffling. The authors introduce *coded computation* to mitigate the impact of straggler nodes, which are significantly slower than others, by leveraging erasure codes to ensure that the result can be recovered from a subset of subtask results. For matrix multiplication, they show that coded computation can speed up the process by a factor of $\log n$ compared to uncoded methods. For data shuffling, they propose *coded shuffling* to reduce communication costs by exploiting excess storage. The authors demonstrate that coded shuffling can reduce communication costs by a factor of $(\alpha + 1)\gamma(n)$, where $\alpha$ is the fraction of the data matrix that can be cached at each worker and $\gamma(n)$ is the ratio of multicasting to unicasting costs. The paper also includes experimental results that validate the theoretical gains of coded algorithms.This paper explores the use of coding theory to enhance the performance of distributed machine learning algorithms, focusing on two key components: matrix multiplication and data shuffling. The authors introduce *coded computation* to mitigate the impact of straggler nodes, which are significantly slower than others, by leveraging erasure codes to ensure that the result can be recovered from a subset of subtask results. For matrix multiplication, they show that coded computation can speed up the process by a factor of $\log n$ compared to uncoded methods. For data shuffling, they propose *coded shuffling* to reduce communication costs by exploiting excess storage. The authors demonstrate that coded shuffling can reduce communication costs by a factor of $(\alpha + 1)\gamma(n)$, where $\alpha$ is the fraction of the data matrix that can be cached at each worker and $\gamma(n)$ is the ratio of multicasting to unicasting costs. The paper also includes experimental results that validate the theoretical gains of coded algorithms.

Speeding Up Distributed Machine Learning Using Codes

29 Jan 2018 | Kangwook Lee, Maximilian Lam, Ramtin Pedarsani, Dimitris Papailiopoulos, and Kannan Ramchandran, Fellow, IEEE