Release Information
Version 2.1.0.0
BuildStream: Catalog-Driven Build Automation
Omnia BuildStreaM provides a comprehensive automation solution for managing infrastructure build workflows. It uses a catalog-driven approach where you define your build requirements in a structured catalog file, and BuildStreaM executes automated pipelines to create and deploy images according to your specifications.
BuildStreaM addresses the key challenges in HPC cluster image management:
Automation: Eliminates manual build and deployment processes
Integration: Works seamlessly with existing Omnia deployments
Traceability: Provides complete audit trails for all build operations
Support for Installation of Additional Packages
Enables the installation of additional packages on the cluster nodes, allowing to extend cluster functionality with custom software and tools.
Add and Remove Slurm Compute Nodes
Provides the ability to add and remove Slurm compute nodes from the cluster, allowing for dynamic scaling of the cluster.
Support for Apptainer
Run apptainer pull to store the SIF container image on the cluster’s NFS-mounted shared storage. This ensures uniform access across all compute nodes, enabling them to run jobs from the same SIF file.
Telemetry Collection from OME and SFM
Enables collection of telemetry data from OpenManage Enterprise (OME) and Smart Fabric Manager (SFM), providing insights into cluster health, performance, and resource utilization.
PowerVault Storage Integration
The PowerVault integration, with a preconfigured volume, provides persistent storage for critical Slurm controller components using iSCSI block storage with multipath support. This ensures data persistence for Slurm’s StateSaveLocation and the MariaDB/MySQL database.
InfiniBand Networking Support
Provides comprehensive support for InfiniBand (IB) networking in HPC clusters, including automatic DOCA-OFED driver installation for Mellanox InfiniBand cards, and static IP assignment for IB interfaces. InfiniBand Networking requires an InfiniBand subnet manager (SM) configured and running to manage the IB fabric.
Version 2.0.0.0
Support for Podman Containers
Enables deployment of the following Omnia core services as Podman containers, ensuring secure, lightweight, and OCI-compliant environments for HPC clusters. This simplifies lifecycle management, accelerates updates, and improves isolation for critical services:
Omnia Core — Orchestrates infrastructure management operations.
Omnia Auth — Provides LDAP-based authentication.
OpenCHAMI — Powers diskless provisioning workflows.
Pulp Repository Service — Hosts local repositories for air-gapped deployments.
Repository Management
Provides a Pulp-based local repository service deployed as a Podman container, enabling secure and efficient package distribution in air-gapped HPC environments. This reduces dependency on external networks and accelerates provisioning workflows.
Authentication Service
Integrates an LDAP server within the Omnia Auth Podman container for centralized authentication and directory services. This enhances security and simplifies identity management across HPC clusters.
Telemetry Collection and Monitoring
Automates the configuration of Kubernetes Service Clusters to host essential monitoring components for telemetry collection and monitoring. The following telemetry capabilities are supported:
iDRAC Telemetry — Collects out-of-band system metrics, including power, thermal, and hardware health data, from Dell servers. Telemetry data is streamed as time-series data to Kafka or VictoriaMetrics, depending on deployment requirements. VictoriaPump is included for storing telemetry metrics in the VictoriaMetrics database.
LDMS Telemetry — Captures in-band performance metrics such as CPU, memory, network, and I/O usage from Slurm cluster nodes. Metrics are streamed as time-series data to Kafka for scalable ingestion and analysis.
Air-gapped telemetry support — Supports telemetry collection in air-gapped or offline environments to meet security and compliance requirements.
Kubernetes Cluster High Availability
Delivers built-in high-availability (HA) failover for Service Kubernetes Cluster control plane nodes, ensuring uninterrupted cluster management and improved resilience for HPC workloads.
Provisioning and Deployment Based on Functional Groups
Enables role-based provisioning for HPC clusters using mapping files. Automatically assigns functional roles (for example, Slurm Control Node and Login Node) and deploys customized operating system images tailored to workload-specific configurations.
The following functional roles are supported:
Login Node
Login Compiler Node
Slurm Node
Slurm Control Node
Service Kubernetes Node
Service Kubernetes Control Plane
Stateless Boot
Introduces stateless provisioning for RHEL 10 using OpenCHAMI, reducing deployment time and storage overhead for HPC clusters.
Automatic CUDA Installation for GPU Workloads
Automatically installs CUDA during node provisioning, ensuring GPU-enabled nodes are ready for HPC workloads immediately after deployment. This reduces manual setup time and accelerates readiness for GPU-intensive applications.
Security Enhancements
Credentials are now encrypted using industry-standard algorithms (for example, AES-256), improving compliance with security best practices and reducing the risk of data exposure.
Platform Support
Supports x86_64 and aarch64 architectures, enabling deployment on both traditional and ARM-based HPC nodes for improved flexibility and energy efficiency.
Input Template and Validator
Provides predefined configuration templates and early input validation to reduce configuration errors and accelerate HPC cluster provisioning. This improves deployment reliability and overall user experience.
If you have any feedback about Omnia documentation, please reach out at omnia.readme@dell.com.