Upgrade Omnia

Omnia supports in-place upgrades from version 2.1.0.0 to 2.2.0.0. The upgrade process is a three-phase workflow: core container upgrade, prepare, and execute. Each component is upgraded in a defined order with lock-based safety and manifest tracking for idempotent reruns.

Important

  • Upgrades must be initiated from the OIM host using omnia.sh --upgrade before entering the omnia_core container.

  • The upgrade orchestrator must be invoked from the parent directory containing upgrade/ folders.

  • Ensure a full backup of the OIM node is taken before starting the upgrade.

Supported Upgrade Paths

Source Version

Target Version

Omnia 2.1.0.0

Omnia 2.2.0.0

Note

Direct upgrades across multiple major versions (e.g., 2.0 → 2.2) are not supported. Upgrade one version at a time.

Prerequisites

Before starting the upgrade, ensure the following prerequisites are met:

  1. The OIM node is running and accessible.

  2. The omnia_core container is running and the cluster is currently on Omnia 2.1.0.0.

  3. All compute nodes are in a healthy state.

  4. A full backup of the OIM node and critical data has been taken (NFS shares, credentials, and configuration files).

  5. No other upgrade or rollback is currently in progress.

  6. oim_metadata.yml at /opt/omnia/.data/oim_metadata.yml contains the correct current version information.

  7. The target Omnia 2.2.0.0 core container image (omnia_core:2.2) is available locally on the OIM host. If it is not already available, build it as described in Build the Omnia 2.2.0.0 Core Container Image.

  8. aarch64 clusters only: If the PXE mapping file contains aarch64 functional groups, an inventory file with an [admin_aarch64] group is required. This group must contain exactly one ARM admin node. See aarch64-inventory for details.

Build the Omnia 2.2.0.0 Core Container Image

The upgrade swaps the running omnia_core container to the 2.2.0.0 image. This image must be present on the OIM host before you run omnia.sh --upgrade. To build it:

  1. On the OIM host, clone the Omnia artifactory repository on the omnia-container-v2.2.0.0 branch:

    git clone -b omnia-container-v2.2.0.0 https://github.com/dell/omnia-artifactory.git
    
  2. Build the core container image using the build script provided in the repository:

    cd omnia-artifactory
    ./build_images.sh core core_tag=2.2 omnia_branch=v2.2.0.0
    
  3. Confirm the image is available locally before proceeding:

    podman images | grep omnia_core
    

Note

Ensure the OIM host has stable internet connectivity and sufficient disk space while building the container image.

Upgrade Workflow

Phase 0: Core Container Upgrade (OIM Host)

The upgrade begins on the OIM host outside the omnia_core container:

  1. Run the core container upgrade command:

    sudo ./omnia.sh --upgrade
    
  2. The script performs the following:

    • Detects current version from oim_metadata.yml

    • Shows available upgrade targets

    • Validates version and image availability

    • Requests user approval

    • Creates backup of configs, metadata, and input files

    • Swaps or restarts the omnia_core container to the 2.2 image

    • Creates upgrade guard lock at /opt/omnia/.data/upgrade_in_progress.lock

    • Seeds new input defaults

    • Displays post-upgrade instructions

  3. After the container swap completes, SSH into the new omnia_core container to proceed with input preparation and component upgrades.

Upgrade Component Order

The upgrade orchestrator processes components in the following fixed order:

Order

Component

Description

1

oim

Omnia Infrastructure Manager (includes OpenCHAMI)

2

build_stream

BuildStreaM enablement / upgrade (terminal gate)

3

local_repo

Local repository staging

4

build_image

Compute image rebuild

5

provision

Cloud-Init and BSS configuration generation

6

k8s

Kubernetes cluster upgrade

7

telemetry

Telemetry component upgrade

8

slurm

Slurm cluster upgrade

Safety Mechanisms

The upgrade is designed to be safe to rerun and to fail cleanly:

  • Validation before changes — All validation checks (version, locks, existing upgrade state) run before any change is made to the cluster. If a check fails, the upgrade stops without leaving the system in a locked state.

  • Automatic lock cleanup — If a component fails partway, the upgrade lock is still released at the end of the run so you can investigate and rerun without manually clearing locks.

  • Idempotent reruns — Already-completed components are skipped automatically when you rerun the upgrade, so only pending or failed components are processed.

Phase 1: Prepare Upgrade

The prepare_upgrade.yml playbook transforms input files from the source version format to the target version format, restores credentials from backup, and presents a summary for user review.

  1. SSH into the OIM node and enter the omnia_core container:

    ssh omnia_core
    
  2. Run the prepare upgrade playbook:

    cd /omnia/upgrade
    ansible-playbook prepare_upgrade.yml
    
  3. Review the output summary. The playbook identifies:

    • Automatically migrated files — copied as-is (e.g., provision_config.yml, omnia_config.yml).

    • Files requiring review — new parameters added in the target version (e.g., network_spec.yml, telemetry_config.yml).

  4. Update any new or changed parameters in /opt/omnia/input/project_default/ as needed.

Phase 2: Execute Upgrade

Run the full upgrade:

cd /omnia/upgrade
ansible-playbook upgrade.yml

aarch64 clusters: If your PXE mapping file contains aarch64 functional groups (e.g., slurm_node_aarch64), you must pass an inventory file with the [admin_aarch64] group:

cd /omnia/upgrade
ansible-playbook upgrade.yml -i <inventory_file>

The inventory file must define exactly one ARM admin node under the [admin_aarch64] group. Example inventory:

[admin_aarch64]
<arm_admin_node_ip_or_hostname>

Note

  • The [admin_aarch64] group must contain exactly one host. Multiple hosts or an empty group will cause the upgrade to fail.

  • The ARM admin node must be accessible via SSH from the OIM host.

  • NFS must be configured on the OIM for aarch64 image building to work.

  • If your cluster has only x86_64 nodes (no aarch64 entries in the PXE mapping file), the -i option is not required.

Lock Management

The upgrade orchestrator uses lock files to prevent concurrent operations:

  • /opt/omnia/.data/upgrade_in_progress.lock — Created at the start of the upgrade. Removed only on successful completion.

  • /opt/omnia/.data/rollback_in_progress.lock — If this lock exists, the upgrade aborts with an error. A rollback must complete (or the lock must be manually removed) before an upgrade can proceed.

Note

The omnia.sh --upgrade wrapper may pre-create the upgrade lock. The playbook detects this and proceeds normally without failing.

Manifest Tracking

The upgrade state is tracked in /opt/omnia/.data/upgrade_manifest.yml. This manifest records:

  • upgrade_id — Unique identifier for this upgrade run.

  • source_version — The version being upgraded from (derived from oim_metadata.yml).

  • target_version — The version being upgraded to.

  • upgrade_status — Overall status: in-progress, completed, or partial.

  • component_status — Per-component status: pending, in-progress, completed, skipped, or failed.

On rerun, already-completed components are automatically skipped. This ensures idempotent execution — you can safely rerun the upgrade after fixing a failed component.

BuildStreaM Terminal Gate

When enable_build_stream=true in build_stream_config.yml, the BuildStreaM terminal gate activates. The upgrade playbook determines the BuildStreaM path based on the state in 2.1:

PATH A: BuildStreaM was ENABLED in 2.1 (upgrade path)

  • Upgrade BuildStreaM container image (quadlet update)

  • PostgreSQL data migration (pg_dump → restore to new schema)

  • Update GitLab configuration (URLs, runner tokens, registry)

  • Validate BuildStreaM container + GitLab healthy

PATH B: BuildStreaM was DISABLED in 2.1, ENABLED in 2.2 (fresh install)

  • NFS share cleanup: remove stale K8s and Slurm NFS share data

  • Fresh install: PostgreSQL container (new instance)

  • Fresh install: BuildStreaM container

  • Fresh install: GitLab container + runner registration

  • Validate all three containers healthy

Note

NFS share cleanup is not automatic — the playbook displays guidance and prompts the operator to confirm manual cleanup before proceeding. The playbook verifies that NFS share directories are empty or absent after operator confirmation.

After the build_stream component completes, the following downstream components are automatically skipped:

  • local_repo

  • build_image

  • provision

  • k8s

  • telemetry

  • slurm

These components are managed by the GitLab CI/CD pipeline instead. The user must trigger the GitLab pipeline manually after upgrade completes. The GitLab pipeline always performs a fresh install — no incremental/delta builds are supported.

Note

When enable_build_stream=false, the build_stream component is marked skipped in the manifest instead of being left as pending.

Slurm Upgrade

The Slurm upgrade updates cloud-init and BSS configurations and reboots all Slurm and login nodes to apply them.

Warning

  • All Slurm and login nodes reboot simultaneously. Ensure no critical jobs are running.

  • Do not modify Slurm node entries in the PXE mapping file until upgrade completes.

  • Omnia 2.1 NFS mount points are preserved. Do not modify during upgrade.

The upgrade performs the following steps:

  1. Updates cloud-init and BSS configurations for each Slurm functional group (slurm_control_node, slurm_node, login).

  2. Reboots all Slurm and login nodes simultaneously with a 600-second timeout per node.

  3. Waits for SSH connectivity to restore on each node (up to 60 seconds).

  4. Validates Slurm services using sinfo with retries on each node.

  5. Generates a node status report with the following categories:

    • Successful — Reboot complete, SSH active, sinfo responding

    • Unreachable — Node was not reachable before reboot

    • Reboot Failed — Reboot command failed

    • SSH Failure — Node did not reconnect after reboot

    • Sinfo Failure — Slurm services did not respond after reboot

Post-Upgrade Verification

After the upgrade completes, verify the following:

  1. Check the upgrade summary displayed at the end of the playbook run.

  2. Verify the upgrade manifest:

    cat /opt/omnia/.data/upgrade_manifest.yml
    
  3. Confirm all component statuses show completed or skipped.

  4. Validate cluster health:

    • Kubernetes cluster: kubectl get nodes

    • Slurm cluster: sinfo

    • Telemetry: Verify metrics are being collected

  5. If BuildStreaM is enabled, trigger the GitLab pipeline for downstream components.

For troubleshooting upgrade issues, see Upgrade and Rollback Troubleshooting.

If you have any feedback about Omnia documentation, please reach out at omnia.readme@dell.com.