Document toolboxDocument toolbox

(v2.3.0.0) Customizing the Slurm Application Environment

Edit the Slurm Environment JSON for Your Purposes:

  1. Copy default-slurm-env.json to something convenient like env0.json.

    1. cp default-slurm-env.json env0.json
  2. Note Line numbers listed below reference the above example file. Once changes start being made on the system, the line numbers may change.

  3. Line 2 : "EnvName" is set to slurm by default, but you can specify something unique if needed.

    1. NOTE: Currently, - characters are not supported in values forEnvName.

  4. Lines 5-20 can be modified for a single pool of identical compute resources or they can be duplicated and then modified for each “hardware” configuration or “pool” you choose. When duplicating, be sure to add a comma after the brace on line 17, except when it is the last brace, or the final pool declaration.

    1. PoolName: This will be the apparent hostnames of the compute resources provided for slurm.

      1. It is recommended that all pools share a common trunk or base in each PoolName.

    2. PoolSize: This is the maximum number of these compute resources.

    3. ProfileName: This is the default profile name, az1: If this is changed, you will need to carry the change forward.

    4. CPUs: This is the targeted CPU-core limit for this "hardware" configuration or pool.

    5. ImageName: This is tied to the AMI that will be used for your compute resources. This name will be used in subsequent steps.

    6. MaxMemory: This is the targeted memory limit for this "hardware" configuration or pool.

    7. MinMemory: reserved for future use; can be ignored currently.

    8. UserData: This string is a base64 encoded version of user_data.

      1. To generate it:

        1. cat user_data.sh | base64 -w 0

      2. To decode it:

        1. echo "<LongBase64EncodedString>" | base64 -d

      3. It’s not required to be perfectly fine-tuned at this stage; it will be refined and corrected later.

      4. You may format user_data.sh in the usual ways:

        1. Simple slurm example:

          #!/bin/bash set -x #export SLURM_BIN_DIR=/opt/slurm/bin export SLURM_BIN_DIR=/usr/bin hostname XSPOT_NODENAME ${SLURM_BIN_DIR}/scontrol update nodename=XSPOT_NODENAME nodeaddr=`hostname -I | cut -d" " -f1` systemctl start slurmd
        2. APC Example:

        3. #!/bin/bash set -x APCHEAD=XXX.XX.X.XXX #enter APC Head Node IP Address ###### hostname XSPOT_NODENAME #For trouble shooting #echo root:TroubleShooting |chpasswd #sed -i 's/PasswordAuthentication no/PasswordAuthentication yes/g' /etc/sshd/sshd_config #sed -i 's/UsePAM yes/UsePAM no/g' /etc/sshd/sshd_config sed -i 's/#PermitRootLogin yes/PermitRootLogin yes/g' /etc/ssh/sshd_config echo 'ssh-rsa 0101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010 root@APCHEAD' >> /root/.ssh/authorized_keys systemctl restart sshd mkdir -p /home /opt/parallelcluster/shared /opt/intel /opt/slurm for i in /home /opt/parallelcluster/shared /opt/intel /opt/slurm ; do echo Mounting ${APCHEAD}:${i} ${i} mount -t nfs ${APCHEAD}:${i} ${i} echo Mounting ${APCHEAD}:${i} ${i} : SUCCESS. done mkdir /exoniv echo 'fs-0553a8e956ccff4da.efs.us-east-1.amazonaws.com:/ /exoniv nfs4 nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=30,retrans=2,noresvport,_netdev 0 0' >> /etc/fstab mount -a #add local users, real users, and/or testing users groupadd -g 899 exo useradd -u 1001 -g 899 krs groupadd -g 401 slurm groupadd -g 402 munge useradd -g 401 -u 401 slurm useradd -g 402 -u 402 munge rpm -ivh /opt/parallelcluster/shared/munge/x86_64/munge-0.5.14-1.el7.x86_64.rpm cp -p /opt/parallelcluster/shared/munge/munge.key /etc/munge/ chown munge.munge /etc/munge /var/log/munge mkdir -p /var/spool/slurmd chown slurm.slurm /var/spool/slurmd sleep 5 systemctl start munge if [[ $? -ne 0 ]]; then sleep 10 systemctl start munge fi SLURM_BIN_PATH=/opt/slurm/bin SLURM_SBIN_PATH=/opt/slurm/sbin SLURM_CONF_DIR=/opt/slurm/etc ${SLURM_BIN_PATH}/scontrol update nodename=XSPOT_NODENAME nodeaddr=`hostname -I | cut -d" " -f1` #systemctl start slurmd ${SLURM_SBIN_PATH}/slurmd -f ${SLURM_CONF_DIR}/slurm.conf -N XSPOT_NODENAME
    9. VolumeSize: reserved for future use; can be ignored currently.

  5. Lines 24, 25, 26 should be modified for your slurm environment and according to your preference for the partition name.

    1. BinPath: This is where scontrol, squeue, and other slurm binaries exist.

    2. ConfPath: This is where slurm.conf resides.

    3. PartitionName: This is for naming the new partition.

  6. All other fields/lines in this asset can be ignored.