Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Current »

Edit the Slurm Environment JSON for Your Purposes:

  1. Copy default-slurm-env.json to something convenient like env0.json.

    1. cp default-slurm-env.json env0.json
  2. Note Line numbers listed below reference the above example file. Once changes start being made on the system, the line numbers may change.

  3. Line 2 : "EnvName" is set to slurm by default, but you can specify something unique if needed.

    1. NOTE: Currently, - characters are not supported in values forEnvName.

  4. Lines 5-20 can be modified for a single pool of identical compute resources or they can be duplicated and then modified for each “hardware” configuration or “pool” you choose. When duplicating, be sure to add a comma after the brace on line 17, except when it is the last brace, or the final pool declaration.

    1. PoolName: This will be the apparent hostnames of the compute resources provided for slurm.

      1. It is recommended that all pools share a common trunk or base in each PoolName.

    2. PoolSize: This is the maximum number of these compute resources.

    3. ProfileName: This is the default profile name, az1: If this is changed, you will need to carry the change forward.

    4. CPUs: This is the targeted CPU-core limit for this "hardware" configuration or pool.

    5. ImageName: This is tied to the AMI that will be used for your compute resources. This name will be used in subsequent steps.

    6. MaxMemory: This is the targeted memory limit for this "hardware" configuration or pool.

    7. MinMemory: reserved for future use; can be ignored currently.

    8. UserData: This string is a base64 encoded version of user_data.

      1. To generate it:

        1. cat user_data.sh | base64 -w 0

      2. To decode it:

        1. echo "<LongBase64EncodedString>" | base64 -d

      3. It’s not required to be perfectly fine-tuned at this stage; it will be refined and corrected later.

      4. You may format user_data.sh in the usual ways:

        1. Simple slurm example:

          #!/bin/bash
          set -x
          #export SLURM_BIN_DIR=/opt/slurm/bin
          export SLURM_BIN_DIR=/usr/bin
          hostname XSPOT_NODENAME
          ${SLURM_BIN_DIR}/scontrol update nodename=XSPOT_NODENAME nodeaddr=`hostname -I | cut -d" " -f1`
          systemctl start slurmd
        2. APC Example:

        3. #!/bin/bash
          set -x
          APCHEAD=XXX.XX.X.XXX     #enter APC Head Node IP Address
          ######
          hostname XSPOT_NODENAME
          #For trouble shooting
          #echo root:TroubleShooting |chpasswd
          #sed -i 's/PasswordAuthentication no/PasswordAuthentication yes/g' /etc/sshd/sshd_config
          #sed -i 's/UsePAM yes/UsePAM no/g' /etc/sshd/sshd_config
          sed -i 's/#PermitRootLogin yes/PermitRootLogin yes/g' /etc/ssh/sshd_config
          echo 'ssh-rsa 0101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010 root@APCHEAD' >> /root/.ssh/authorized_keys
          systemctl restart sshd
          mkdir -p /home /opt/parallelcluster/shared /opt/intel /opt/slurm
          for i in /home /opt/parallelcluster/shared /opt/intel /opt/slurm ; do
          	echo Mounting ${APCHEAD}:${i} ${i}
          	mount -t nfs ${APCHEAD}:${i} ${i}
          	echo Mounting ${APCHEAD}:${i} ${i} : SUCCESS.
          done
          mkdir /exoniv
          echo 'fs-0553a8e956ccff4da.efs.us-east-1.amazonaws.com:/ /exoniv nfs4 nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=30,retrans=2,noresvport,_netdev 0 0' >> /etc/fstab
          mount -a
          #add local users, real users, and/or testing users
          groupadd -g 899 exo
          useradd -u 1001 -g 899 krs
          groupadd -g 401 slurm
          groupadd -g 402 munge
          useradd -g 401 -u 401 slurm
          useradd -g 402 -u 402 munge
          rpm -ivh /opt/parallelcluster/shared/munge/x86_64/munge-0.5.14-1.el7.x86_64.rpm
          cp -p /opt/parallelcluster/shared/munge/munge.key /etc/munge/
          chown munge.munge /etc/munge /var/log/munge
          mkdir -p /var/spool/slurmd
          chown slurm.slurm /var/spool/slurmd
          sleep 5
          systemctl start munge
          if [[ $? -ne 0 ]]; then 
          	sleep 10
          	systemctl start munge
          fi
          SLURM_BIN_PATH=/opt/slurm/bin
          SLURM_SBIN_PATH=/opt/slurm/sbin
          SLURM_CONF_DIR=/opt/slurm/etc
          ${SLURM_BIN_PATH}/scontrol update nodename=XSPOT_NODENAME nodeaddr=`hostname -I | cut -d" " -f1`
          #systemctl start slurmd
          ${SLURM_SBIN_PATH}/slurmd -f ${SLURM_CONF_DIR}/slurm.conf -N XSPOT_NODENAME
          
    9. VolumeSize: reserved for future use; can be ignored currently.

  5. Lines 24, 25, 26 should be modified for your slurm environment and according to your preference for the partition name.

    1. BinPath: This is where scontrol, squeue, and other slurm binaries exist.

    2. ConfPath: This is where slurm.conf resides.

    3. PartitionName: This is for naming the new partition.

  6. All other fields/lines in this asset can be ignored.

  • No labels