/
(v2.2.0.1) Integrating with Slurm Cluster

(v2.2.0.1) Integrating with Slurm Cluster

The following manual steps for EAR will be replaced with a simplified workflow for command line users and alternatively, the Mangement Console (Web UI) will be able to replace most of these steps, as well.

Connect to Your Slurm Head Node

During Early Access, integreation requires a handful of commands and root or sudo access on the slurm controller, where slurmctld runs.

  1. Get a shell on the head node and navigate to the slurm configuration directory, where slurm.conf resides.

    1. [ root@HEAD ~ ]# cd $SLURM_CONF_DIR
  2. Make a subdirectories here:

    1. [ root@HEAD /etc/slurm ]# mkdir -p assets/json [ root@head /etc/slurm ]# cd assets/json [ root@head /etc/slurm/assets/json ]#

Pull Down the Default Slurm Environment Assets as a JSON Payload:

  1. We will need the private MGMT_SERVER_IP, and the packages for jq and curl:

    1. yum install jq curl
  2. Pull down from the MGMT_SERVER default assets for customization:

  3. The asset will look like:

Edit the Slurm Environment JSON for Your Purposes:

  1. Copy default-slurm-env.json to something convenient like env0.json.

  2. Lines 5-17 can be modified for a single pool of identical compute resources or they can be duplicated and then modified for each “hardware” configuration or “pool” you choose. When duplicating, be sure to add a comma after the brace on line 17, except when it is the last brace, or the final pool declaration.

    1. PoolName: This will be the apparent hostnames of the compute resources provided for slurm.

      1. It is recommended that all pools share a common trunk or base in each PoolName.

    2. PoolSize: This is the maximum number of these compute resources.

    3. ProfileName: This is the default profile name, az1: If this is changed, you will need to carry the change forward.

    4. CPUs: This is the targeted CPU-core limit for this "hardware" configuration or pool.

    5. ImageName: This is tied to the AMI that will be used for your compute resources. This name will be used in subsequent steps.

    6. MaxMemory: This is the targeted memory limit for this "hardware" configuration or pool.

    7. MinMemory: reserved for future use; can be ignored currently.

    8. UserData: This string is a base64 encoded version of user_data.

      1. To generate it:

        1. cat user_data.sh | base64 -w 0

      2. To decode it:

        1. echo "<LongBase64EncodedString>" | base64 -d

      3. It’s not required to be perfectly fine-tuned at this stage; it will be refined and corrected later.

    9. VolumeSize: reserved for future use; can be ignored currently.

  3. Lines 24, 25, 26 should be modified for your slurm environment and according to your preference for the partition name.

    1. BinPath: This is where scontrol, squeue, and other slurm binaries exist.

    2. ConfPath: This is where slurm.conf resides.

    3. PartitionName: This is for naming the new partition.

  4. All other fields/lines in this asset can be ignored.

Validate and Push the Customized Environment to the MGMT_SERVER

  1. Validate the JSON asset with jq:

    1. You will see well-formatted JSON if jq can read the file, indicating no errors. If you see an error message, that means the JSON is not valid.

  2. When the JSON is valid, the file can be pushed to the MGMT_SERVER:

Pull Down the Default Profile Assets as a JSON Payload:

  1. get it

  2. back it up

  3. The asset will look like this:

Edit the Profile JSON for Your Purposes:

  1. Lines 5-9, 25-29 : tagging

    1. Controllers

    2. Workers

    3. JSON reminder

  2. Line 11: InstanceType

    1. Controllers, always ondemand

  3. Line 20: MaxControllers

    1. x80

  4. Line 21: ProfileName

  5. Lines 31-34 on-demand-worker explanation

  6. Lines 38 -43 spot_fleet worker explanation

  7. Line 48: Hyperthreading: reserved for future use, ignore

  8. Line 52 NodeGroupName : string appears in controller Name tagging

  9. All other field/lines can be ignored in the asset.

Validate and Push the Customized Profile to the MGMT_SERVER

  1. validate it

  2. send it up

Download Scheduler Assets from the Management Server

  1. We will need the private MGMT_SERVER_IP:

Compute AMI Preparation

An AMI is required that is capable of running your workloads. Ideally, this AMI is capable of booting quickly.

Validation of Migratable VM Joined to Your Slurm Cluster

The script test_createVm.sh exists for a quick validation.

  1. Check the user_data.sh script. This script will be executed in the VM on boot by cloud-init.

    1. Verify paths are correct to slurm binaries by:

      1. Adding SLURM_BIN_DIR to the file:

      2. Searching and replacing all scontrol with ${SLURM_BIN_DIR}/scontrol.

  2. Using a host from the xspot.slurm.conffile, (default values are xspot-vm0[1-9]), and the IMAGE_NAME set previously on line 10 in parse_helper.sh, run test_createVM.sh with the flags and arguments as below:

  3. The script will continuously output updates until the VM is created. When the VM is ready and has joined your Slurm Cluster, the script will exit and you’ll see all the fields in the output are now filled with values:

  4. (Optional) The user_data.sh referenced in step 2 above may need some tuning for your environment:

    1. This step may require several iterations as tuning and troubleshooting are critical to integration.

    2. You can ssh to the VM using the IP reported on the last line of output and validate the environment.

      1. Special consideration and possible temporary tuning in the user_data.sh may be required to allow access to the VM, to troubleshoot issues with services, authentication, and other vital requirements for your environment.

  5. When satisfied that the migratable VM has joined the scheduler, this temporary asset can be cleaned up by killing the job with the following command:

    1. Replace MGMT_SERVER_IP with the IP address of the management server

    2. Replace VM_NAME with the name of the VM (option -h) in step 2

Finalize Integration with Slurm

  1. Create a reference file as a reminder with the prerequsite information:

    1. SLURM_CONF_DIR, SLURM_BIN_DIR, and MGMT_SERVER_IP:

  2. Site-specific customization is required forresume_xspot.sh:

    1. Add SLURM_BIN_DIR and USER_DATA_FILE to the top of the file with the correct paths:

    2. Search and replace all scontrol with ${SLURM_BIN_DIR}/scontrol.

    3. Specify a log location for autoscaling:

    4. Comment out “Example #1: inline” command below. Uncomment the command below “Example #2: from a file…”

    5. Replace "ImageName": "ubuntu", with "ImageName": "IMAGE_NAME", based on your <IMAGE_NAME> in the parse_helper.sh and test_createVM.sh commands above.

  3. Site-specific customization is required for suspend_xspot.sh:

    1. Specify the same log location for autoscaling:

    2. Add SLURM_BIN_DIR to the top of the file with the correct path:

    3. Search and replace all scontrol with ${SLURM_BIN_DIR}/scontrol.

  4. Edit your slurm.conf:

    1. Add include statement to pick up xspot.slurm.conf:

  5. Edit xspot.slurm.conf:

    1. ResumeProgram: autoscaling of migratable VMs will be handled by ${SLURM_CONF_DIR}/assets/xspot_resume.sh

      1. For APC users or others with a preexisting autoscaling solution, we’ll replace the ResumeProgram in slurm.conf with a wrapper to handle both the existing autoscaling and Infrastructure Optimizer’s.

    2. SuspendProgram: autoscaling of migratable VMs will be cleaned up by ${SLURM_CONF_DIR}/assets/xspot_suspend.sh

      1. For APC users, we’ll replace the SuspendProgram in slurm.conf with a wrapper to handle both the APC autoscaling and Infrastructure Optimizer’s.

    3. ResumeRate, Partition’s “Name”, and Partition’s “OverSubscribe” settings may be reset as best suits your environment. Recommendations are as follows:

      1. Set ResumeRate=100

      2. Set Partition Name to something memorable and distinct, e.g. “exostellar.”

      3. Delete OverSubscribe= and everything after it.

    4. Increase the node-count for the NodeName= declaration by increasing the range of node names. By default, the config ships with 9 compute nodes. Multiples of 80 make the most sense with default configs, so change 9 nodes to 800 by replacing:

      1. with

    5. Introducing new nodes into a slurm cluster requires restart of the slurm control deamon:

  6. Integration steps are complete and a job submission to the new partition is the last validation:

    1. As a user, navigate to a valid job submission directory and launch a job as normal, but be sure to specifiy the new partition:

      1. sbatch -p NewPartitionName < job-script.sh