Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

This troubleshooting guide is designed to help you identify and resolve common issues in your Infrastructure Optimizer environment. Using this guide, you can quickly pinpoint problems and implement solutions to ensure smooth operations.

Please use the following command to pack all logs on the Exostellar Management Server:

Code Block
languagebash
tar -czvf exostellar-logs-$(date +%Y-%m-%d).tar.gz /var/log/messages /var/log/munge/munged.log /var/log/slurm/aws.log /var/log/slurm/slurmctld.log /home/slurm /xcompute/logs /xcompute/slurm/bin/xcompute-daemon/data

Logs

General

Expand
titleHow do you handle spot disruptions?

Exostellar has an algorithm to predict Amazon EC2 Spot Instance terminations in advance, enabling proactive migration before actual termination occurs. In the worst-case scenario, where AWS provides only a two-minute warning, Exostellar quickly and automatically initiates the migration process. Smaller-sized instances typically migrate faster and often have better EC2 Spot availability. Exostellar recommends using these selected instance types and ensures on-demand reliability with them.

Screenshot 2024-10-10 at 15.20.42.pngImage Added
Expand
titleWhat happens if there are no available spot instances?

Exostellar’s scheduler searches for EC2 Spot availability in related instance types you have identified in the same region. If no EC2 Spot instance types are available for those types, the system will automatically and safely migrate your workloads back to on-demand instances, ensuring no disruption to your operations.

Expand
titleWhat is your latency?

Based on existing tests performed on HPC workloads, the latency is minimal compared with AWS native EC2 Instances. However, performance can vary depending on the specific application. On average, the performance overhead is very small, and in some cases, workloads run even faster.

Expand
titleHow much can I save using Infrastructure Optimizer?

The exact savings depend on several factors, including your actual cloud usage, chosen region, selected instance types, and EC2 Spot availability. Exostellar provides a savings estimation tool on our website's homepage, which helps you calculate potential savings based on your specific setup. You can try it out here: https://exostellar.io/ .

Screenshot 2024-10-10 at 17.59.27.pngImage Added

Troubleshooting

Expand
titleWhere can I find all logs for debugging?

For diagnosing issues related to the Exostellar Management Server, you will primarily rely on the following log locations:

  • /var/log/messages

  • /var/log/munge/munged.log

  • /var/log/slurm/aws.log

  • /var/log/slurm/slurmctld.log

  • /home/slurm

For any debugging relating to the Exostellar Controller and the Worker, the logs can be found at:

  • /xcompute/slurm/bin/xcompute-daemon/data/

  • /xcompute/logs/

    • Within this directory you will find the messages directory and an xspot directory. This xspot directory will contain X-Spot specific logs, in addition to logs for the jobs and workers spawned by that controller.

If you need help troubleshooting, please use the following command to pack all logs on the Exostellar Management Server and upload the file with your support request here:

Code Block
languagebash
tar -czvf exostellar-logs-$(date +%Y-%m-%d).tar.gz /var/log/messages /var/log/munge/munged.log /var/log/slurm/aws.log /var/log/slurm/slurmctld.log /home/slurm /xcompute/logs /xcompute/slurm/bin/xcompute-daemon/data