Add or Remove Slurm Nodes to the Cluster

Omnia supports addition and removal of Slurm compute nodes from an existing cluster.

Add Slurm Node to the Cluster

To add a new Slurm node to the cluster, follow these steps:

For OME-based BMC discovery:

  1. Ensure the new node is added and configured in OpenManage Enterprise (OME).

  2. Run the discovery.yml playbook with the OME discovery mechanism to discover the new nodes:

    ssh omnia_core
    cd /omnia/discovery
    ansible-playbook discovery.yml -e "discovery_mechanism=ome"
    
  3. Update the PXE mapping file path in the provision_config.yml.

  4. Run the provision.yml playbook:

    cd /omnia/provision
    ansible-playbook provision.yml
    
  5. PXE boot the newly added nodes.

  6. To enable telemetry collection using iDRAC telemetry service, run the telemetry.yml playbook. For steps to initiate telemetry collection, see Step 15: Initialize and Verify Telemetry

For mapping file discovery:

  1. Update the PXE mapping file with new Slurm node entries. Add entries for new nodes with appropriate functional group assignments slurm_node_x86_64.

Note

While updating the mapping file, ensure that the existing nodes are not removed from the mapping file.

Note

Addition of new slurm_control_node is not supported.

  1. Run the provision.yml playbook to discover the new nodes. For more information, see Provision the Cluster Nodes.

  2. PXE boot the newly added nodes.

  3. To enable telemetry collection using iDRAC telemetry service, run the telemetry.yml playbook. For steps to initiate telemetry collection, see Step 15: Initialize and Verify Telemetry

Note

You do not need to run the telemetry.yml playbook if the service kubernetes cluster nodes are configured to collect telemetry data only using LDMS. By default, LDMS begins collection of data after provision.yml playbook is executed.

Remove Slurm nodes

To remove a Slurm node from the cluster, follow these steps:

For OME-based BMC discovery:

  1. Remove or reassign the node in OpenManage Enterprise (OME).

  2. Run the discovery.yml playbook with the OME discovery mechanism:

    ssh omnia_core
    cd /omnia/discovery
    ansible-playbook discovery.yml -e "discovery_mechanism=ome"
    
  3. Update the PXE mapping file path in the provision_config.yml.

  4. Run the provision.yml playbook:

    cd /omnia/provision
    ansible-playbook provision.yml
    
  5. To stop telemetry collection using iDRAC telemetry service from the removed nodes, run the telemetry.yml playbook.

For mapping file discovery:

  1. Update the PXE mapping file. Remove or reassign nodes that should no longer be part of the Slurm cluster.

  2. Run the provision.yml playbook.

  3. To stop telemetry collection using iDRAC telemetry service from the removed nodes, run the telemetry.yml playbook.

Note

You do not need to run the telemetry.yml playbook to stop telemetry collection using LDMS from the removed nodes. By default, LDMS stops collection of data after provision.yml playbook is executed.

If you have any feedback about Omnia documentation, please reach out at omnia.readme@dell.com.