Document toolboxDocument toolbox

(v2.3.0.0) Compute Cluster Updates Slurm


Compute Cluster Updates

From time to time, the cluster may require updates. Simlarly, the base images leveraged in Exostellar’s stack may need those same updates.

If the Updates are Minimal

For certain CVEs where a single sed command is all that’s required, the user_data.shscripts may suffice. Similarly, instaneous or relatively trivial changes to the image are good candidates for update via user_data.sh. If the additional commands amount to more than a few seconds of per-boot work, it is recommended to use an updated version of the AMI as described below.

If the Updates are More Substantial

At times, required software in the cluster may be introduced that requires significant changes due to new dependencies. When it comes to updates that require more than 10 seconds to realize, it is recommended to leverage an updated version of the AMI and stop using the outdated image asset going forward. The steps below outline the process that will be required.

Fast Option:Replace the Contents of the Updated AMI by Leveraging the Snapshot-ID
  1. With the AMI-ID, we can query AWS for the Snapshot-ID or we can find it in the AWS Console:

    1. AWS cli query for CentOS7 and Rocky9 AMIs:

      1. aws ec2 describe-images --image-ids <AMI_ID> --query 'Images[0].BlockDeviceMappings[?DeviceName==`/dev/sda1`].Ebs.SnapshotId' --output text
    2. AWS Console : Navigate to updated EC2 > AMIs > <AMI_ID> and look for the Block Devices entry on the page, which contains the snapshot-id embedded in string: “/dev/sda1=snap-007d78a7a00b2dc84:30:true:gp2”.

  2. To swap out the previous Snapshot-ID with the new, first pull down the image information as a JSON object:

    1. curl -v -X GET http://${MGMT_SERVER_IP}x:5000/v1/image/<image_name> |jq . > in.json
  3. Update the in.json file with the new Snpashot-ID.

  4. Send the in.json file back to the Management Server:

    1. curl -v -d "@in.json" -H 'Content-Type: application/json' -X PUT http://${MGMT_SERVER_IP}:5000/v1/image
  5. While jobs running will continue to use the previous image, jobs submitted after this point will pick up

Longer Option: Add a the New Image as a Subsequent Iteration and Reconfigure

To leave the previous image intact and unchanged, repeat the process for parsing an AMI and then update the environment.

  1. When performing the install we retrieved some assets from the management server and stored them in the Exostellar slurm configuration directory.

    1. cd $SLURM_CONF_DIR/exostellar
    2. The parse_helper.sh script is used to take an AMI and import it into Infrastructure Optimizer.

      ./parse_helper.sh -a <AMI-ID> -i <IMAGE_NAME>
    3. The AMI-ID should

      1. be based on a slurm compute node from your cluster, capable of running your workloads.

      2. be created by this account.

      3. not have product codes.

    4. The Image Name should be unique in Infrastructure Optimizer.

    5. Additionally, we can pass -s script.sh if troubleshooting is required.

  2. When job completes, update env.jsonwith new image name.

    1. Backup the original.

      cp env0.json env0.json.orig
    2. Edit the env0.json file to add the new image name, replacing the previous image name.

  3. Push the updated env0.json to the MGMT_SERVER:

    curl -d "@env0.json" -H 'Content-Type: application/json' -X PUT http://${MGMT_SERVER_IP}:5000/v1/env
  4. If any profile is currently using the image that was just replaced, set that profile to drain so that active jobs run through to completion but new jobs will be routed to fresh assets leveraging the new image.

    1. For any profiles using the old IMAGE_NAME

      curl -d "@profile0.json" -H 'Content-Type: application/json' -X PUT "http://${MGMT_SERVER_IP}:5000/v1/profile?force=true&drain=true"

      NOTE: The intial profile was named profile0.json in the ${SLURM_CONF_DIR}/exostellar/json folder according to previous steps in this documentation.