Upgrade Omnia
Omnia supports in-place upgrades from version 2.1.0.0 to 2.2.0.0. The upgrade process is a three-phase workflow: core container upgrade, prepare, and execute. Each component is upgraded in a defined order with lock-based safety and manifest tracking for idempotent reruns.
Important
Upgrades must be initiated from the OIM host using
omnia.sh --upgradebefore entering theomnia_corecontainer.The upgrade orchestrator must be invoked from the parent directory containing
upgrade/folders.Ensure a full backup of the OIM node is taken before starting the upgrade.
Supported Upgrade Paths
Source Version |
Target Version |
|---|---|
Omnia 2.1.0.0 |
Omnia 2.2.0.0 |
Note
Direct upgrades across multiple major versions (e.g., 2.0 → 2.2) are not supported. Upgrade one version at a time.
Prerequisites
Before starting the upgrade, ensure the following prerequisites are met:
The OIM node is running and accessible.
The
omnia_corecontainer is running and the cluster is currently on Omnia 2.1.0.0.All compute nodes are in a healthy state.
A full backup of the OIM node and critical data has been taken (NFS shares, credentials, and configuration files).
No other upgrade or rollback is currently in progress.
oim_metadata.ymlat/opt/omnia/.data/oim_metadata.ymlcontains the correct current version information.The target Omnia 2.2.0.0 core container image (
omnia_core:2.2) is available locally on the OIM host. If it is not already available, build it as described in Build the Omnia 2.2.0.0 Core Container Image.aarch64 clusters only: If the PXE mapping file contains aarch64 functional groups, an inventory file with an
[admin_aarch64]group is required. This group must contain exactly one ARM admin node. See aarch64-inventory for details.
Build the Omnia 2.2.0.0 Core Container Image
The upgrade swaps the running omnia_core container to the 2.2.0.0 image. This image must be present on the OIM host before you run omnia.sh --upgrade. To build it:
On the OIM host, clone the Omnia artifactory repository on the
omnia-container-v2.2.0.0branch:git clone -b omnia-container-v2.2.0.0 https://github.com/dell/omnia-artifactory.git
Build the core container image using the build script provided in the repository:
cd omnia-artifactory ./build_images.sh core core_tag=2.2 omnia_branch=v2.2.0.0
Confirm the image is available locally before proceeding:
podman images | grep omnia_core
Note
Ensure the OIM host has stable internet connectivity and sufficient disk space while building the container image.
Upgrade Workflow
Phase 0: Core Container Upgrade (OIM Host)
The upgrade begins on the OIM host outside the omnia_core container:
Run the core container upgrade command:
sudo ./omnia.sh --upgrade
The script performs the following:
Detects current version from
oim_metadata.ymlShows available upgrade targets
Validates version and image availability
Requests user approval
Creates backup of configs, metadata, and input files
Swaps or restarts the
omnia_corecontainer to the 2.2 imageCreates upgrade guard lock at
/opt/omnia/.data/upgrade_in_progress.lockSeeds new input defaults
Displays post-upgrade instructions
After the container swap completes, SSH into the new
omnia_corecontainer to proceed with input preparation and component upgrades.
Upgrade Component Order
The upgrade orchestrator processes components in the following fixed order:
Order |
Component |
Description |
|---|---|---|
1 |
|
Omnia Infrastructure Manager (includes OpenCHAMI) |
2 |
|
BuildStreaM enablement / upgrade (terminal gate) |
3 |
|
Local repository staging |
4 |
|
Compute image rebuild |
5 |
|
Cloud-Init and BSS configuration generation |
6 |
|
Kubernetes cluster upgrade |
7 |
|
Telemetry component upgrade |
8 |
|
Slurm cluster upgrade |
Safety Mechanisms
The upgrade is designed to be safe to rerun and to fail cleanly:
Validation before changes — All validation checks (version, locks, existing upgrade state) run before any change is made to the cluster. If a check fails, the upgrade stops without leaving the system in a locked state.
Automatic lock cleanup — If a component fails partway, the upgrade lock is still released at the end of the run so you can investigate and rerun without manually clearing locks.
Idempotent reruns — Already-completed components are skipped automatically when you rerun the upgrade, so only pending or failed components are processed.
Phase 1: Prepare Upgrade
The prepare_upgrade.yml playbook transforms input files from the source version format to the target version format, restores credentials from backup, and presents a summary for user review.
SSH into the OIM node and enter the
omnia_corecontainer:ssh omnia_core
Run the prepare upgrade playbook:
cd /omnia/upgrade ansible-playbook prepare_upgrade.yml
Review the output summary. The playbook identifies:
Automatically migrated files — copied as-is (e.g.,
provision_config.yml,omnia_config.yml).Files requiring review — new parameters added in the target version (e.g.,
network_spec.yml,telemetry_config.yml).
Update any new or changed parameters in
/opt/omnia/input/project_default/as needed.
Phase 2: Execute Upgrade
Run the full upgrade:
cd /omnia/upgrade
ansible-playbook upgrade.yml
aarch64 clusters: If your PXE mapping file contains aarch64 functional groups (e.g., slurm_node_aarch64), you must pass an inventory file with the [admin_aarch64] group:
cd /omnia/upgrade
ansible-playbook upgrade.yml -i <inventory_file>
The inventory file must define exactly one ARM admin node under the [admin_aarch64] group. Example inventory:
[admin_aarch64]
<arm_admin_node_ip_or_hostname>
Note
The
[admin_aarch64]group must contain exactly one host. Multiple hosts or an empty group will cause the upgrade to fail.The ARM admin node must be accessible via SSH from the OIM host.
NFS must be configured on the OIM for aarch64 image building to work.
If your cluster has only x86_64 nodes (no aarch64 entries in the PXE mapping file), the
-ioption is not required.
Lock Management
The upgrade orchestrator uses lock files to prevent concurrent operations:
/opt/omnia/.data/upgrade_in_progress.lock— Created at the start of the upgrade. Removed only on successful completion./opt/omnia/.data/rollback_in_progress.lock— If this lock exists, the upgrade aborts with an error. A rollback must complete (or the lock must be manually removed) before an upgrade can proceed.
Note
The omnia.sh --upgrade wrapper may pre-create the upgrade lock. The playbook detects this and proceeds normally without failing.
Manifest Tracking
The upgrade state is tracked in /opt/omnia/.data/upgrade_manifest.yml. This manifest records:
upgrade_id — Unique identifier for this upgrade run.
source_version — The version being upgraded from (derived from
oim_metadata.yml).target_version — The version being upgraded to.
upgrade_status — Overall status:
in-progress,completed, orpartial.component_status — Per-component status:
pending,in-progress,completed,skipped, orfailed.
On rerun, already-completed components are automatically skipped. This ensures idempotent execution — you can safely rerun the upgrade after fixing a failed component.
BuildStreaM Terminal Gate
When enable_build_stream=true in build_stream_config.yml, the BuildStreaM terminal gate activates. The upgrade playbook determines the BuildStreaM path based on the state in 2.1:
PATH A: BuildStreaM was ENABLED in 2.1 (upgrade path)
Upgrade BuildStreaM container image (quadlet update)
PostgreSQL data migration (pg_dump → restore to new schema)
Update GitLab configuration (URLs, runner tokens, registry)
Validate BuildStreaM container + GitLab healthy
PATH B: BuildStreaM was DISABLED in 2.1, ENABLED in 2.2 (fresh install)
NFS share cleanup: remove stale K8s and Slurm NFS share data
Fresh install: PostgreSQL container (new instance)
Fresh install: BuildStreaM container
Fresh install: GitLab container + runner registration
Validate all three containers healthy
Note
NFS share cleanup is not automatic — the playbook displays guidance and prompts the operator to confirm manual cleanup before proceeding. The playbook verifies that NFS share directories are empty or absent after operator confirmation.
After the build_stream component completes, the following downstream components are automatically skipped:
local_repobuild_imageprovisionk8stelemetryslurm
These components are managed by the GitLab CI/CD pipeline instead. The user must trigger the GitLab pipeline manually after upgrade completes. The GitLab pipeline always performs a fresh install — no incremental/delta builds are supported.
Note
When enable_build_stream=false, the build_stream component is marked skipped in the manifest instead of being left as pending.
Slurm Upgrade
The Slurm upgrade updates cloud-init and BSS configurations and reboots all Slurm and login nodes to apply them.
Warning
All Slurm and login nodes reboot simultaneously. Ensure no critical jobs are running.
Do not modify Slurm node entries in the PXE mapping file until upgrade completes.
Omnia 2.1 NFS mount points are preserved. Do not modify during upgrade.
The upgrade performs the following steps:
Updates cloud-init and BSS configurations for each Slurm functional group (
slurm_control_node,slurm_node,login).Reboots all Slurm and login nodes simultaneously with a 600-second timeout per node.
Waits for SSH connectivity to restore on each node (up to 60 seconds).
Validates Slurm services using
sinfowith retries on each node.Generates a node status report with the following categories:
Successful — Reboot complete, SSH active,
sinforespondingUnreachable — Node was not reachable before reboot
Reboot Failed — Reboot command failed
SSH Failure — Node did not reconnect after reboot
Sinfo Failure — Slurm services did not respond after reboot
Post-Upgrade Verification
After the upgrade completes, verify the following:
Check the upgrade summary displayed at the end of the playbook run.
Verify the upgrade manifest:
cat /opt/omnia/.data/upgrade_manifest.yml
Confirm all component statuses show
completedorskipped.Validate cluster health:
Kubernetes cluster:
kubectl get nodesSlurm cluster:
sinfoTelemetry: Verify metrics are being collected
If BuildStreaM is enabled, trigger the GitLab pipeline for downstream components.
For troubleshooting upgrade issues, see Upgrade and Rollback Troubleshooting.
If you have any feedback about Omnia documentation, please reach out at omnia.readme@dell.com.