SLURM: Simple Linux Utility for Resource Management

SLURM: Simple Linux Utility for Resource Management

2003 | Andy B. Yoo, Morris A. Jette, and Mark Grondona
This paper describes the Simple Linux Utility for Resource Management (SLURM), a new cluster resource management system developed at the Lawrence Livermore National Laboratory. SLURM is a simple, flexible, and fault-tolerant cluster manager that can scale to thousands of processors. It is designed to be easy to port to different cluster sizes and architectures with minimal effort. The authors believe that SLURM will benefit both users and system architects by providing a simple, robust, and highly scalable parallel job execution environment for their cluster systems. Linux clusters, built from commodity off-the-shelf components, have become popular for parallel computing due to their high performance-cost ratio. As the cost of COTS components decreases and cluster architectures become more scalable, it has become economically feasible to build large-scale clusters with thousands of processors. An essential component for harnessing such a computer is a resource management system. This system performs tasks such as scheduling user jobs, monitoring machine and job status, launching user applications, and managing machine configuration. An ideal resource manager should be simple, efficient, scalable, fault-tolerant, and portable. Unfortunately, there are no open-source resource management systems that meet these requirements. Many existing resource managers have poor scalability and fault-tolerance, making them unsuitable for large clusters. Proprietary systems are often expensive, not available in source-code form, and are typically designed for specific computer systems or interconnects. The authors designed SLURM with the following goals: simplicity, open-source availability, portability, interconnect independence, and scalability. SLURM is written in C, uses GNU autoconf, and supports various interconnects and plug-in mechanisms, allowing it to be easily adapted to different infrastructures.This paper describes the Simple Linux Utility for Resource Management (SLURM), a new cluster resource management system developed at the Lawrence Livermore National Laboratory. SLURM is a simple, flexible, and fault-tolerant cluster manager that can scale to thousands of processors. It is designed to be easy to port to different cluster sizes and architectures with minimal effort. The authors believe that SLURM will benefit both users and system architects by providing a simple, robust, and highly scalable parallel job execution environment for their cluster systems. Linux clusters, built from commodity off-the-shelf components, have become popular for parallel computing due to their high performance-cost ratio. As the cost of COTS components decreases and cluster architectures become more scalable, it has become economically feasible to build large-scale clusters with thousands of processors. An essential component for harnessing such a computer is a resource management system. This system performs tasks such as scheduling user jobs, monitoring machine and job status, launching user applications, and managing machine configuration. An ideal resource manager should be simple, efficient, scalable, fault-tolerant, and portable. Unfortunately, there are no open-source resource management systems that meet these requirements. Many existing resource managers have poor scalability and fault-tolerance, making them unsuitable for large clusters. Proprietary systems are often expensive, not available in source-code form, and are typically designed for specific computer systems or interconnects. The authors designed SLURM with the following goals: simplicity, open-source availability, portability, interconnect independence, and scalability. SLURM is written in C, uses GNU autoconf, and supports various interconnects and plug-in mechanisms, allowing it to be easily adapted to different infrastructures.
Reach us at info@study.space