SLURM: A Highly Scalable Resource Manager

SLURM is an open-source resource manager designed for Linux clusters of all sizes. It provides three key functions. First it allocates exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (typically a parallel job) on a set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.

While other resource managers do exist, SLURM is unique in several respects:

  • It is designed to operate in a heterogeneous cluster with up to 65,536 nodes and hundreds of thousands of processors.
  • It can sustain a throughput rate of over 120,000 jobs per hour with bursts of job submissions at several times that rate.
  • Its source code is freely available under the GNU General Public License.
  • It is portable; written in C with a GNU autoconf configuration engine. While initially written for Linux, other UNIX-like operating systems should be easy porting targets.
  • It is highly tolerant of system failures, including failure of the node executing its control functions.
  • A plugin mechanism exists to support various interconnects, authentication mechanisms, schedulers, etc. These plugins are documented and simple enough for the motivated end user to understand the source and add functionality.

 

sudo  apt-get install slurm

 

https://computing.llnl.gov/linux/slurm/