New Features
The following sections describe the new features and enhancements introduced in Omnia 2.2 releases.
BuildStreaM Pipeline Architecture and API Enhancements
Omnia BuildStreaM now supports enhanced pipeline architecture and API capabilities for improved scalability, reliability, and operational flexibility.
The key enhancements include:
Resume & Retry Capability: Retry failed stages with smart resume (artifact reuse), re-run deploy stages after success, per-attempt log segregation, and integration with GitLab native retry mechanisms
Pipeline Decomposition: Split monolithic pipeline into Build and Deploy pipelines with parent-child architecture enabling independent execution and better scalability
Dynamic Child Pipeline Generation: Automatic generation of child pipelines with actual image_group names for image selection workflow
Image Group Lifecycle Tracking: Automated tracking through BUILT → DEPLOYING → DEPLOYED → VALIDATING → PASSED/FAILED → CLEANED states
Cleanup Capability: Manual cleanup operations via GitLab pipeline for removing old images when the build image count exceeds the configured limit
PowerScale Support: Dell PowerScale as optional S3 backend alongside MinIO/NFS
For detailed information, see BuildStreaM Documentation.
Vector Telemetry Pipeline for Data Routing
Omnia now supports Vector as a high-performance data pipeline tool for collecting, transforming, and routing telemetry data from LDMS and OpenManage Enterprise (OME) sources to VictoriaMetrics and VictoriaLogs. This deployment provides enhanced telemetry data flow management with dedicated write-buffer components.
For detailed configuration instructions, see Vector Telemetry Pipeline Configuration.
PowerScale Telemetry for Storage Monitoring
Omnia now supports PowerScale Telemetry for collecting storage performance metrics and logs from Dell PowerScale storage nodes. This deployment provides comprehensive storage observability with CSM Metrics for PowerScale, OpenTelemetry Collector, and integration with CSI Driver for Dell PowerScale.
For detailed configuration instructions, see PowerScale Telemetry Configuration.
Vast Repo and Vast Client Installation
Omnia now supports the installation of Vast NFS client on cluster nodes. This feature allows you to:
Build the Vast repository from source using the provided download script
Host the Vast RPMs on an HTTP server (such as Apache) as a user repository
Configure the Vast repository in
local_repo_config.ymlAutomatically install the Vast client on cluster nodes during the provisioning process when an InfiniBand NIC is present
The Vast repository can be built and hosted following the steps documented in Vast Repo and Vast Client Installation.
Minimal OS Functional Groups
Omnia now supports Minimal OS functional groups (os_x86_64 and os_aarch64) that provide a clean operating system baseline designed specifically for downstream platform software installation.
For detailed information on functional groups and additional packages configuration, see Step 2: Create Mapping File with Node Information.
NVIDIA DCGM and CUDA Toolkit Provisioning for Slurm GPU Nodes
Omnia now delivers end-to-end automated GPU readiness for Slurm clusters. This feature covers NVIDIA driver installation, CUDA toolkit distribution to shared cluster storage, and NVIDIA Data Center GPU Manager (DCGM) setup — all performed during stateless node provisioning, without any user intervention on individual nodes.
NVIDIA driver installation on all GPU-capable Slurm compute nodes
CUDA toolkit made available cluster-wide via a shared NFS location accessible to all nodes simultaneously
DCGM installation with automatic CUDA version detection and appropriate package selection
Configurable DCGM enablement using
dcgm.metrics_enabledundertelemetry_sourcesintelemetry_config.yml(default:true)nvidia-dcgmservice enablement and validated startup on each GPU nodeGPU enumeration and discovery validation using
dcgminvidia-peermemkernel module installation for GPUDirect RDMA-capable environmentsPersistent CUDA environment configuration across login shells, non-login shells, and Slurm job environments
Nodes without NVIDIA GPU hardware are automatically skipped — no manual exclusion required
NVIDIA HPC SDK Provisioning for Slurm Clusters
Omnia now supports cluster-wide deployment of the NVIDIA HPC SDK (nvhpc) for Slurm
compiler and compute nodes. The SDK is installed once on the compiler node via DNF,
copied to shared NFS storage, and made available to all compute nodes through a
bind mount — eliminating repeated downloads or per-node installations.
NVIDIA HPC SDK installed on the compiler node via DNF using pre-configured NVIDIA repositories
SDK binaries and libraries copied to shared NFS at
/hpc_tools/nvidia_sdk/nvhpcAll compute nodes mount the NFS copy via a local bind mount at
/opt/nvidia/nvhpcPersistent environment configuration written to
/etc/profile.d/nvhpc.shon every node, covering compilers (nvc,nvc++,nvfortran), MPI binaries, manual pages, and module filesArchitecture-aware: supports both
x86_64andaarch64without separate configurationNodes without a completed compiler-node installation are blocked with a clear error message rather than silently failing
Setup script (
/usr/local/bin/setup_nvhpc_sdk.sh) is pre-deployed to all nodes during provisioning; the user invokes it post-provisioning at their discretion
For detailed setup instructions, see NVIDIA HPC SDK Setup.
One-Shot Combined Log Extraction for Debugging
Omnia provides a one-shot log collection playbook for gathering cluster logs from Kubernetes and Slurm nodes for debugging and support handoff.
Usage
cd omnia/log_collector
ansible-playbook collect.yml
Collection modes
Full mode (default): Collects all logs from target nodes
ansible-playbook collect.yml
Curated support mode: Excludes temporary and stale log files
ansible-playbook collect.yml --tags curated_support
What is collected
Kubernetes master nodes: Container logs, pod logs, CNI logs, runtime logs, system logs
Kubernetes worker nodes: System logs, bootstrap logs
Slurm controller nodes: Scheduler logs, service logs, database logs, system logs
Slurm compute nodes: Job logs, system logs
Login nodes: System logs, authentication logs
Login compiler nodes: System logs, authentication logs
Output artifacts
Workspace:
/opt/omnia/collector_logsBundle:
omnia_logs_<YYYYMMDD-HHMMSS>.tar.gzMetadata:
metadata.json(included in bundle)Checksum:
.sha256file for integrity verification
Prerequisites
PXE mapping file must exist at
/opt/omnia/input/project_default/pxe_mapping_file.csvNodes must be reachable from OIM
BMC Discovery via Dell OpenManage Enterprise
Omnia introduces automated BMC (Baseboard Management Controller) discovery via Dell OpenManage Enterprise (OME). This feature enables large-scale server discovery and automatic PXE mapping file generation, which is particularly useful for deployments with thousands of nodes.
Key Features
Automated server inventory collection from OME REST API
Support for paginated API queries to handle large-scale deployments (100 servers per page)
Automatic extraction of service tags, iDRAC details, NIC MACs, and group membership
Scalable Unit (SU) extraction from iDRAC hostnames for logical grouping
Timestamped PXE mapping file generation for version control and audit trails
IP address derivation from BMC IPs using configured subnets
OME group mapping to functional groups for role-based provisioning
Configuration Requirements
Dell OpenManage Enterprise (OME) appliance must be operational and have discovered target servers
input/discovery_config.yml- OME IP configurationinput/network_spec.yml- Network configuration for admin and InfiniBand subnetsOME credentials stored in Ansible Vault (
omnia_config_credentials.yml)Run
prepare_oimto set up OME credentials
Usage
To perform BMC discovery using OME:
ansible-playbook discovery/discovery.yml -e "discovery_mechanism=ome"
This generates a timestamped PXE mapping file: bmc_pxe_mapping_file_<timestamp>.csv and a BMC Discovery Report: bmc_discovery_report_<timestamp>.csv that provides NIC link status information for all discovered servers.
Post-Discovery Workflow
Review the generated timestamped CSV file
Review the BMC Discovery Report for NIC link statuses (BMC, Ethernet, InfiniBand)
Adjust functional groups, group names, and hostnames as needed
Copy or rename the desired timestamped file to
pxe_mapping_file.csvProceed with provisioning
For more details, see BMC Discovery Configuration, BMC Discovery Rollback Considerations, and BMC Discovery Report Documentation.
Note
Magellan-based discovery is planned for a future release. Currently, only OME-based discovery is supported.
If you have any feedback about Omnia documentation, please reach out at omnia.readme@dell.com.