Configure Deployment Required for LDMS Telemetry
Using Omnia, you can deploy Lightweight Distributed Metric Service (LDMS) to collect in-band telemetry from Slurm clusters. The deployment includes installing LDMS producers on Slurm nodes, deploying LDMS aggregator and store components on Service Kubernetes nodes, and integrating LDMS with Kafka for downstream telemetry processing.
LDMS collects system metrics such as CPU, memory, network, I/O, and Slurm job statistics. LDMS includes these components:
LDMS producer (collector): Collects local system metrics and runs on Slurm controller, compute, and login nodes.
LDMS aggregator: Receives and aggregates metrics from producers. Runs as a Kubernetes pod.
LDMS store: Buffers and stores metric batches reliably. Runs as a Kubernetes pod.
Kafka broker: Handles telemetry streaming for consumption by downstream systems.
For more details on LDMS, see Lightweight Distributed Metric Service
Note
To consume LDMS metrics from the Kafka ‘ldms’ topic, transform to Prometheus format, and write to VictoriaMetrics, see Configure Vector Telemetry Pipeline to Route Data to Victoria from Kafka.
During deployment, Omnia attaches LDMS aggregator and store pods to the admin network. This configuration improves throughput between Slurm nodes and the Kubernetes cluster.
Supported LDMS Plugins
The following LDMS plugins are supported in Omnia:
meminfo: Memory usage statistics
procstat2: Process statistics
vmstat: Virtual memory statistics
loadavg: System load average
procnetdev2: Network interface statistics
Note
The LDMS Slurm sampler metrics are not supported in the current telemetry deployment.
Prerequisites
Ensure that the
provision.ymlplaybook has been executed successfully withservice_kube_control_planeandservice_kube_nodein the mapping file.
Steps
Specify the following entries in the
software_config.json. If any entry is missing, Omnia skips LDMS deployment and logs an informational message. For more information, see Input Parameters for Local Repositories.
{"name": "slurm_custom", "arch": ["x86_64","aarch64"]},
{"name": "service_k8s", "version": "1.34.1", "arch": ["x86_64"]},
{"name": "ldms", "arch": ["x86_64", "aarch64"]}
Ensure the
ldms.jsonfile contains the following entries.
Note
If the LDMS RPM is not available, refer to Building LDMS Producer RPM Package for instructions on building LDMS RPMs.
The following ldms.json sample is for x86_64. For aarch64 architecture, update the repo name accordingly in the ldms.json file.
{
"ldms": {
"cluster": [
{"package": "python3-devel", "type": "rpm", "repo_name": "x86_64_appstream"},
{"package": "python3-cython", "type": "rpm", "repo_name": "x86_64_appstream"},
{"package": "openssl-libs", "type": "rpm", "repo_name": "x86_64_baseos"},
{"package": "ovis-ldms", "type": "rpm", "repo_name": "x86_64_ldms"}
]
}
}
In
local_repo_config.yml, specify the paths for theovis-ldmsRPMs accordingly for theuser_repo_url_x86_64anduser_repo_url_aarch64.Configure the
omnia_config.yml:omnia_config.yml Variables
Mandatory/Optional
Details
cluster_nameMandatory
Type: String
Name of the cluster on which you want to deploy Kubernetes.
This input is case-sensitive. Do not add any special characters except
_(underscore) in the cluster name.
deploymentMandatory
Type: Boolean
Indicates if Kubernetes will be deployed or not.
Accepted values:
trueorfalse
k8s_cniMandatory
Type: String
Kubernetes SDN network.
Accepted values:
calicoDefault value:
calico
pod_external_ip_rangeMandatory
Type: String
These addresses will be used by the loadbalancer for assigning external IPs to Kubernetes services.
Ensure that the IP range provided is not assigned to any node in the cluster.
Ensure that the
pod_external_ip_rangedefined in theomnia_config.ymlfile is reachable from the OpenManage Enterprise appliance and the SFM network.Sample values:
172.16.107.170-172.16.107.200
k8s_service_addressesOptional
Type: String
Kubernetes internal network for services.
This network must be unused in your network infrastructure.
Default value:
"10.233.0.0/18"
k8s_pod_network_cidrOptional
Type: String
Kubernetes pod network CIDR for internal network. When used, it will assign IP addresses from this range to individual pods.
This network must be unused in your network infrastructure.
Default value:
"10.233.64.0/18"
csi_powerscale_driver_secret_file_pathOptional
Type: File path
If you want to deploy the CSI driver for PowerScale on your service cluster, add the file path of the
secrets.yamlfile to this variable.
csi_powerscale_driver_values_file_pathOptional
Type: File path
If you want to deploy the CSI driver for PowerScale on your service cluster, add the file path of the
values.yamlfile to this variable.
nfs_storage_nameMandatory
Type: String
Use same name as mentioned in each of the
nameavailable instorage_config.yml.
k8s_crio_storage_sizeMandatory
Type: String
Specifies the disk size allocated for CRI-O container storage.
etcd_on_local_diskOptional
Type: Boolean
Determines whether ETCD is deployed on local disk or NFS storage.
Accepted values:
trueorfalseDefault value:
falseWhen set to
true, ETCD is deployed on local disk on all master nodes. The system prioritizes BOSS card if available, and falls back to SSD/SATA disks if BOSS is not present. The/var/lib/etcddirectory is mounted on the selected local disk.When set to
falseor omitted, ETCD storage is provisioned using NFS, and no local disk configuration is performed for ETCD.Important: Migration from NFS to local disk is not supported during upgrades. This configuration is only applicable for fresh installations.
Ensure that the
telemetry_config.ymlhas the entries specific for LDMS and Kafka deployment.Note
For LDMS telemetry configuration, at least one sampler plugin is mandatory to collect system metrics.
telemetry_config.yml Parameter
Mandatory/Optional
Details
telemetry_sources >
idrac>metrics_enabledMandatory
Type: Boolean
Enable or disable iDRAC metrics collection from Dell PowerEdge servers
Collected metrics: temperature, power, fan speed, storage health, CPU/memory errors
- Data path:
iDRAC Receiver -> ActiveMQ -> KafkaPump -> Kafka ‘idrac’ topic
iDRAC Receiver -> ActiveMQ -> VictoriaPump -> vmagent -> victoria_metrics
Accepted values:
trueorfalseDefault value:
true
Note
If iDRAC telemetry is enabled,
mysqldb_user,mysqldb_password, andmysqldb_root_passwordparameters in theomnia_config_credentials.ymlfile become mandatory.Note
If you want to deploy only Slurm clusters (
slurm_custom), setmetrics_enabledtofalse.telemetry_sources >
idrac>collection_targetsMandatory
Collection targets define where iDRAC data is sent before Vector processing
Supported values:
victoria_metrics,kafkaMultiple targets: Can specify both [
victoria_metrics,kafka]Default: [
victoria_metrics,kafka]
idrac_telemetry_configurations >
mysqldb_storageConditional Mandatory
MySQL database storage for iDRAC telemetry
Purpose: Storage configuration for iDRAC telemetry MySQL database
Accepted values: Must be specified in the form of
X[Ki|Mi|Gi|Ti|Pi|Ei]Default value:
1GiRequired when: telemetry_sources > idrac > metrics_enabled is true
telemetry_sources >
ldms>metrics_enabledMandatory
Type: Boolean
Enable or disable LDMS metrics collection from compute nodes
Collected metrics: CPU, memory, network, disk metrics
Data path: LDMS samplers → LDMS aggregator → store_avro_kafka → Kafka ‘ldms’ topic
Accepted values:
trueorfalseDefault value:
true
telemetry_sources >
ldms>collection_targetsMandatory
LDMS only supports Kafka collection (no direct victoria_metrics path)
Vector-LDMS bridge consumes from Kafka and routes to victoria_metrics
Supported values:
kafkaDefault: [
kafka]
telemetry_sources >
dcgm>metrics_enabledOptional
Type: Boolean
Enable or disable DCGM (NVIDIA Data Center GPU Manager) metrics collection
Collected metrics: GPU temperature, utilization, memory, ECC errors, power
Requires: NVIDIA GPU driver installed on compute nodes
Accepted values:
trueorfalseDefault value:
true
telemetry_sources >
powerscale>metrics_enabledOptional
Type: Boolean
Enable or disable PowerScale metrics collection from Dell PowerScale (OneFS) storage
Collected metrics: Storage metrics from Dell PowerScale clusters
Requires: CSM Observability (Karavi) values file configured
Data path: CSM Metrics PowerScale → OTEL Collector → vmagent(shared) → victoria_metrics
Accepted values:
trueorfalseDefault value:
true
telemetry_sources >
powerscale>logs_enabledOptional
Type: Boolean
Enable or disable PowerScale logs collection
Accepted values:
trueorfalseDefault value:
true
telemetry_sources >
powerscale>collection_targetsConditional Mandatory
PowerScale uses dedicated vmagent(shared) (no Kafka, no Vector)
Supported values:
victoria_metrics,victoria_logsDefault: [
victoria_metrics,victoria_logs]
telemetry_sources >
ufm>metrics_enabledOptional
Type: Boolean
Enable or disable UFM (NVIDIA UFM InfiniBand Fabric Manager) metrics collection
Collected metrics: IB port state, transmit/receive data, error counters, fabric topology
Requires: NVIDIA UFM appliance with Prometheus exporter enabled
Data path: UFM Prometheus Exporter → vmagent(shared) → victoria_metrics
Accepted values:
trueorfalseDefault value:
false
telemetry_sources >
ufm>logs_enabledOptional
Type: Boolean
Enable or disable UFM syslog logs collection
Accepted values:
trueorfalseDefault value:
false
telemetry_sources >
ufm>collection_targetsConditional Mandatory
UFM uses vmagent(shared) for metrics and VLAgent for logs
Supported values:
victoria_metrics,victoria_logsDefault: [
victoria_metrics,victoria_logs]
telemetry_sources >
vast>metrics_enabledOptional
Type: Boolean
Enable or disable VAST (Data Storage) metrics collection
Collected metrics: IB port state, transmit/receive data, error counters, fabric topology
Requires: VAST appliance with Prometheus exporter enabled
Data path: Prometheus Exporter → vmagent(shared) → victoria_metrics
Accepted values:
trueorfalseDefault value:
false
telemetry_sources >
vast>logs_enabledOptional
Type: Boolean
Enable or disable VAST syslog logs collection
Accepted values:
trueorfalseDefault value:
false
telemetry_sources >
vast>collection_targetsConditional Mandatory
VAST uses vmagent(shared) for metrics and VLAgent for logs
Supported values:
victoria_metrics,victoria_logsDefault: [
victoria_metrics,victoria_logs]
telemetry_bridges >
vector_ldms>metrics_enabledOptional
Type: Boolean
Enable or disable Vector-LDMS bridge (Kafka-to-victoria_metrics bridge for LDMS metrics)
Purpose: Consume LDMS metrics from Kafka ‘ldms’ topic, transform NERSC schema to Prometheus format, and write to victoria_metrics
Data flow: Kafka ‘ldms’ topic → Vector-LDMS → vmagent-vector → victoria_metrics
Requires: telemetry_sources > ldms > metrics_enabled = true
Accepted values:
trueorfalseDefault value:
true
telemetry_bridges >
vector_ome>metrics_enabledOptional
Type: Boolean
Enable or disable Vector-OME metrics routing (Kafka-to-Victoria bridge for OME metrics)
Data flow: Kafka ‘ome.*’ topics → Vector-OME → vmagent-vector (metrics)
Requires: OME to be configured with kafka
Accepted values:
trueorfalseDefault value:
true
telemetry_bridges >
vector_ome>logs_enabledOptional
Type: Boolean
Enable or disable Vector-OME logs routing
Data flow: Kafka ‘ome.*’ topics → Vector-OME → vlagent-vector (logs)
Accepted values:
trueorfalseDefault value:
true
telemetry_bridges >
vector_ome>ome_identifierOptional
Identifier used by Vector-OME for topic identification and routing
Internally used to match topics with the prefix (e.g.,
^ome\\..*$)Type: String
minLength: 1
Default value:
omeNote: Change only if your OME Kafka topics use a different prefix
telemetry_sinks >
victoria_metrics>persistence_sizeConditional Mandatory
Storage per vmstorage pod PVC
- Important: Total VictoriaMetrics storage depends on deployment mode:
Single-node mode: Total storage =persistence_size * 1 podCluster mode: Total storage =persistence_size * 3 vmstorage pods
Example (cluster):
8Gi * 3 = 24Gitotal VictoriaMetrics storageAccepted values: Must be specified in the form of
X[Ki|Mi|Gi|Ti|Pi|Ei]Default value:
8Gi(results in 24Gi total storage for cluster mode)
telemetry_sinks >
victoria_metrics>retention_periodConditional Mandatory
Metric retention period in hours
Default: 168 (7 days)
telemetry_sinks >
victoria_metrics>additional_metric_remote_write_endpointsOptional
Additional remote write endpoints for metrics (optional)
Purpose: Send metrics to external VictoriaMetrics instances in addition to Omnia-managed VictoriaMetrics
Format: List of endpoint objects with ‘url’ field (must start with http:// or https://)
TLS: Set ‘tls_insecure_skip_verify: true’ to skip TLS certificate verification
Default: [] (empty — only Omnia VictoriaMetrics receives metrics)
Example: - url: https://external-metrics-server:8480/insert/0/prometheus/api/v1/write
tls_insecure_skip_verify: false
telemetry_sinks >
victoria_logs>storage_sizeConditional Mandatory
Storage per vlstorage pod PVC
Total storage = storage_size × 3 vlstorage pods
Accepted values: Must be specified in the form of
X[Ki|Mi|Gi|Ti|Pi|Ei]Default value:
8Gi(results in 24Gi total storage)Sizing formula: (140 MB/day × retention_days × node_count) / 3 replicas
Warning
Storage under-provisioning can lead to data loss before the retention period is reached. Calculate storage requirements based on expected log volume and retention needs.
telemetry_sinks >
victoria_logs>retention_periodConditional Mandatory
Log retention period in hours
Type: Integer (hours)
Accepted values: 24-8760 (1 day to 1 year)
Default: 168 (7 days)
Note: Retention is global and applies to all log streams uniformly. Deletion occurs asynchronously during background merge operations.
Note
VictoriaLogs does not return an error when log entries with timestamps outside the configured retention window are submitted. Log entries will be automatically removed from VictoriaLogs after the retention period.
telemetry_sinks >
victoria_logs>additional_log_write_endpointsOptional
Additional remote write endpoints for logs (optional)
Purpose: Send logs to external VictoriaLogs instances in addition to Omnia-managed VictoriaLogs
Format: List of endpoint objects with ‘url’ field (must start with http:// or https://)
TLS: Set ‘tls_insecure_skip_verify: true’ to skip TLS certificate verification
Default: [] (empty — only Omnia VictoriaLogs receives logs)
Example: - url: https://external-logs-server:9481/internal/insert
tls_insecure_skip_verify: false
telemetry_sinks >
kafka>persistence_sizeConditional Mandatory
Storage per Kafka pod PVC
Total = persistence_size × 6 pods (3 brokers + 3 controllers)
Accepted values: Must be specified in the form of
X[Ki|Mi|Gi|Ti|Pi|Ei]Default value:
8Gi(results in 48Gi total storage)The default
8Gisize is suitable for small clusters (typically fewer than 5 nodes). For larger clusters, you should increase thepersistence_sizeand adjustlog_retention_hoursandlog_retention_bytesbased on expected data volume and cluster size.
Caution
Ensure that the Kafka broker settings
persistence_size,log_retention_hours, andlog_retention_bytesare configured based on your data retention requirements. If the persistent volume reaches its capacity before logs are deleted according to the log retention period configured, Kafka brokers may run out of disk space. For more details on managing Kafka log retention and cleanup policies, see Managing Kafka logs with delete and compact policies.telemetry_sinks >
kafka>log_retention_hoursConditional Mandatory
Log retention period in hours
Default: 168 (7 days)
telemetry_sinks >
kafka>log_retention_bytesConditional Mandatory
Maximum size of Kafka logs (in bytes) before deletion
Default: -1 (unlimited)
telemetry_sinks >
kafka>log_segment_bytesConditional Mandatory
Maximum size of Kafka log segments (in bytes)
Default: 1073741824 (1 GB)
telemetry_sinks >
kafka>topic_partitionsConditional Mandatory
Topic partitions per source (object format, not array)
Format: {topic_name: partition_count}
Required when: Source has kafka in collection_targets
Allowed topics:
idrac,ldmsonlyDefault partition counts:
idrac=1,ldms=2Example: {idrac: 1, ldms: 2}
ldms_configurations >
agg_portConditional Mandatory
Aggregator port on service K8s cluster
Valid range: 6001-6100
Default: 6001
ldms_configurations >
store_portConditional Mandatory
Store daemon port on service K8s cluster
The port can be the same as LDMS aggregator port
Valid range: 6001-6100
Default: 6001
ldms_configurations >
sampler_portConditional Mandatory
Sampler port on compute nodes
Valid range: 10001-10100
Default: 10001
ldms_configurations >
sampler_pluginsMandatory
Sampler plugins define which metrics to collect from compute nodes
- Parameters:
plugin_name: Name of the LDMS sampler plugin
config_parameters: Plugin-specific configuration (as a single string)
- activation_parameters: Collection schedule in MICROSECONDS
Format:
interval=<microseconds> offset=<microseconds>Example:interval=30000000(30 seconds)
- Default plugins:
meminfo: Memory usage statistics (free, used, buffers, cached)procstat2: Process statistics (CPU, memory, I/O per process)vmstat: Virtual memory statistics (paging, swapping, memory pressure)loadavg: System load average (1, 5, and 15 minute averages)procnetdev2: Network interface statistics (bytes, packets, errors, drops per interface)
Default activation_parameters:
interval=30000000(30 seconds for all plugins except procnetdev2 which includes offset=0)
powerscale_configurations >
otel_collector_storage_sizeConditional Mandatory
PVC size for OTEL Collector metric batching and buffering
Accepted values: Must be specified in the form of
X[Ki|Mi|Gi|Ti|Pi|Ei]Default value:
5Gi
powerscale_configurations >
csm_observability_values_file_pathConditional Mandatory
Path to the CSM Observability (Karavi Observability) values.yaml file
Required when: telemetry_sources > powerscale > metrics_enabled: true
Default value:
"
Note
In the values.yaml file, only set
karaviMetricsPowerscale -> enabled: true. Set the following parameters to false:karaviMetricsPowerflex -> enabled=false,karaviMetricsPowerstore -> enabled=false,karaviMetricsPowerscaleauthorization.-> enabled=false,karaviMetricsPowermax -> enabled=false.Note
Update the
isiAuthTypein the values.yaml file based on the current auth type setting. To check the current auth type setting, use the commandisi http settings view.Note
For CSI PowerScale health metrics, enable
controller > healthMonitor > enabled: trueandnode > healthMonitor > enabled: truein the CSI PowerScale values.yaml (https://raw.githubusercontent.com/dell/helm-charts/csi-isilon-2.15.0/charts/csi-isilon/values.yaml).ufm_configuration >
ufm_endpointConditional Mandatory
UFM appliance IP address or hostname
Required when: telemetry_sources > ufm > metrics_enabled is true
Example:
172.20.44.180orufm.example.comDefault value:
"
ufm_configuration >
ufm_metrics_portOptional
UFM Prometheus exporter port
Default value: 9001 (UFM default Prometheus port)
ufm_configuration >
scrape_intervalOptional
Prometheus scrape interval for UFM metrics
Accepted values: Prometheus duration format (e.g.,
15s,30s,1m)Default value:
30s
ufm_configuration >
scrape_timeoutOptional
Prometheus scrape timeout (must be <= scrape_interval)
Accepted values: Prometheus duration format (e.g.,
10s,15s)Default value:
15s
ufm_configuration >
tls_modeOptional
TLS mode for connecting to UFM Prometheus endpoint
Accepted values:
self_signed,ca_signedself_signed: Skip TLS verification (insecure_skip_verify=true)
ca_signed: Use CA certificate for TLS verification
Default value:
self_signed
ufm_configuration >
ufm_ca_cert_pathOptional
Path to CA certificate file for UFM TLS verification
Required when: tls_mode is
ca_signedMust be a valid PEM-format certificate file
Default value:
"(empty — not used when tls_mode isself_signed)
ufm_configuration >
auth_modeOptional
Authentication mode for UFM Prometheus endpoint
Accepted values:
basic,nonebasic: Use ufm_username/ufm_password from omnia_config_credentials.yml
none: No authentication (UFM endpoint is open)
Default value:
basic
vast_configuration >
vast_endpointConditional Mandatory
VAST appliance IP address or hostname
Required when: telemetry_sources > vast > metrics_enabled is true
Example:
172.20.44.180orvast.example.comDefault value:
"
vast_configuration >
vast_metrics_portOptional
VAST Prometheus exporter port
Default value: 9001 (VAST default Prometheus port)
vast_configuration >
scrape_intervalOptional
Prometheus scrape interval for VAST metrics
Accepted values: Prometheus duration format (e.g.,
15s,30s,1m)Default value:
30s
vast_configuration >
scrape_timeoutOptional
Prometheus scrape timeout (must be <= scrape_interval)
Accepted values: Prometheus duration format (e.g.,
10s,15s)Default value:
15s
vast_configuration >
tls_modeOptional
TLS mode for connecting to VAST Prometheus endpoint
Accepted values:
self_signed,ca_signedself_signed: Skip TLS verification (insecure_skip_verify=true)
ca_signed: Use CA certificate for TLS verification
Default value:
self_signed
vast_configuration >
vast_ca_cert_pathOptional
Path to CA certificate file for VAST TLS verification
Required when: tls_mode is
ca_signedMust be a valid PEM-format certificate file
Default value:
"(empty — not used when tls_mode isself_signed)
vast_configuration >
auth_modeOptional
Authentication mode for VAST Prometheus endpoint
Accepted values:
basic,nonebasic: Use vast_username/vast_password from omnia_config_credentials.yml
none: No authentication (VAST endpoint is open)
Default value:
basic
If you have any feedback about Omnia documentation, please reach out at omnia.readme@dell.com.