Configure Deployment Required for LDMS Telemetry

Using Omnia, you can deploy Lightweight Distributed Metric Service (LDMS) to collect in-band telemetry from Slurm clusters. The deployment includes installing LDMS producers on Slurm nodes, deploying LDMS aggregator and store components on Service Kubernetes nodes, and integrating LDMS with Kafka for downstream telemetry processing.

LDMS collects system metrics such as CPU, memory, network, I/O, and Slurm job statistics. LDMS includes these components:

  • LDMS producer (collector): Collects local system metrics and runs on Slurm controller, compute, and login nodes.

  • LDMS aggregator: Receives and aggregates metrics from producers. Runs as a Kubernetes pod.

  • LDMS store: Buffers and stores metric batches reliably. Runs as a Kubernetes pod.

  • Kafka broker: Handles telemetry streaming for consumption by downstream systems.

For more details on LDMS, see Lightweight Distributed Metric Service

During deployment, Omnia attaches LDMS aggregator and store pods to the admin network. This configuration improves throughput between Slurm nodes and the Kubernetes cluster.

Prerequisites

  • Ensure that the discovery.yml playbook has been executed successfully with service_kube_control_plane and service_kube_node in the mapping file.

Steps

  1. Specify the following entries in the software_config.json. If any entry is missing, Omnia skips LDMS deployment and logs an informational message. For more information, see Input Parameters for Local Repositories.

{"name": "slurm_custom", "arch": ["x86_64","aarch64"]},
{"name": "service_k8s", "version": "1.34.1", "arch": ["x86_64"]},
{"name": "ldms", "arch": ["x86_64", "aarch64"]}
  1. Ensure the ldms.json file contains the following entries.

Note

If the LDMS RPM is not available, refer to Building LDMS Producer RPM Package for instructions on building LDMS RPMs.

The following ldms.json sample is for x86_64. For aarch64 architecture, update the repo name accordingly in the ldms.json file.

{
    "ldms": {
        "cluster": [
            {"package": "python3-devel", "type": "rpm", "repo_name": "x86_64_appstream"},
            {"package": "python3-cython", "type": "rpm", "repo_name": "x86_64_appstream"},
            {"package": "openssl-libs", "type": "rpm", "repo_name": "x86_64_baseos"},
            {"package": "ovis-ldms", "type": "rpm", "repo_name": "x86_64_ldms"}
        ]
    }
}
  1. In local_repo_config.yml, specify the paths for the ovis-ldms RPMs accordingly for the user_repo_url_x86_64 and user_repo_url_aarch64.

  2. Configure the omnia_config.yml:

    omnia_config.yml

    Variables

    Mandatory/Optional

    Details

    cluster_name

    Mandatory

    • Type: String

    • Name of the cluster on which you want to deploy Kubernetes.

    • This input is case-sensitive. Do not add any special characters except _ (underscore) in the cluster name.

    deployment

    Mandatory

    • Type: Boolean

    • Indicates if Kubernetes will be deployed or not.

    • Accepted values: true or false

    k8s_cni

    Mandatory

    • Type: String

    • Kubernetes SDN network.

    • Accepted values: calico

    • Default value: calico

    pod_external_ip_range

    Mandatory

    • Type: String

    • These addresses will be used by the loadbalancer for assigning external IPs to Kubernetes services.

    • Ensure that the IP range provided is not assigned to any node in the cluster.

    • Sample values: 172.16.107.170-172.16.107.200

    k8s_service_addresses

    Optional

    • Type: String

    • Kubernetes internal network for services.

    • This network must be unused in your network infrastructure.

    • Default value: "10.233.0.0/18"

    k8s_pod_network_cidr

    Optional

    • Type: String

    • Kubernetes pod network CIDR for internal network. When used, it will assign IP addresses from this range to individual pods.

    • This network must be unused in your network infrastructure.

    • Default value: "10.233.64.0/18"

    csi_powerscale_driver_secret_file_path

    Optional

    • Type: File path

    • If you want to deploy the CSI driver for PowerScale on your service cluster, add the file path of the secrets.yaml file to this variable.

    csi_powerscale_driver_values_file_path

    Optional

    • Type: File path

    • If you want to deploy the CSI driver for PowerScale on your service cluster, add the file path of the values.yaml file to this variable.

    nfs_storage_name

    Mandatory

    • Type: String

    • Use same name as mentioned in each of the nfs_name available in storage_config.yml.

    k8s_crio_storage_size

    Mandatory

    • Type: String

    • Specifies the disk size allocated for CRI-O container storage.

  3. Ensure that the telemetry_config.yml has the entries specific for LDMS and Kafka deployment.

    Note

    For LDMS telemetry configuration, at least one sampler plugin is mandatory to collect system metrics.

    telemetry_config.yml

    Parameter

    Mandatory/Optional

    Details

    idrac_telemetry_support

    Mandatory

    • Type: Boolean

    • If you want iDRAC telemetry support on your service cluster, set this variable to true before executing telemetry.yml and discovery.yml playbooks.

    • Accepted values: true or false

    • Default value: true

    Note

    If idrac_telemetry_support is set to true,``mysqldb_user``, mysqldb_password, and mysqldb_root_password parameters in the omnia_config_credentials.yml file becomes mandatory.

    Note

    If you want to deploy only Slurm clusters (slurm_custom), idrac_telemetry_support must be set to false.

    idrac_telemetry_collection_type

    Mandatory

    • Specify where to store iDRAC telemetry data.

    • Supported values:
      • victoria: Store in VictoriaMetrics only

      • Kafka: Store in Kafka only

      • victoria,kafka : Store in both (recommended)

    • Default: victoria,kafka

    victoria_configurations > deployment_mode

    Mandatory

    • Supported values:
      • single-node: Simple deployment (1 pod, suitable for dev/test)

      • cluster: High-availability deployment (7 pods, recommended for production)

    • Default: cluster

    • Cluster Mode Benefits:
      • High availability (no single point of failure)

      • Horizontal scalability (scale components independently)

      • Better performance (4x ingestion, 2x query speed)

      • Production-ready architecture

    • Single-Node Benefits:
      • Simple setup (fewer resources)

      • Suitable for small deployments (<10 nodes)

      • Lower resource usage (~4Gi memory vs ~10Gi for cluster)

    victoria_configurations > persistence_size

    Conditional Mandatory

    • The amount of storage allocated for each VictoriaMetrics persistent volume.

    • Important: Total VictoriaMetrics storage depends on deployment mode:
      • Single-node mode: Total storage = persistence_size * 1 pod

      • Cluster mode: Total storage = persistence_size * 3 vmstorage pods

    • Example (cluster): 8Gi *  3 = 24Gi total VictoriaMetrics storage

    • Accepted values: Must be specified in the form of X[Ki|Mi|Gi|Ti|Pi|Ei]

    • Default value: 8Gi (results in 24Gi total storage for cluster mode)

    victoria_configurations > retention_period

    Conditional Mandatory

    • Specify the number of hours to retain victoria logs before they are deleted.

    • Default: 168 (7 days)

    kafka_configurations > persistence_size

    Conditional Mandatory

    • The amount of storage allocated for each Kafka persistent volume.

    • Important: Total Kafka storage = persistence_size * 6 pods * 3 Kafka brokers (each gets persistence_size storage) * 3 Kafka controllers (each gets persistence_size storage)

    • Example: 8Gi * 6 = 48Gi total Kafka storage

    • Accepted values: Must be specified in the form of X[Ki|Mi|Gi|Ti|Pi|Ei]

    • Default value: 8Gi (results in 48Gi total storage)

    • The default 8Gi size is suitable for small clusters (typically fewer than 5 nodes). For larger clusters, you should increase the persistent size and adjust log_retention_hours and log_retention_bytes based on expected data volume and cluster size.

    Caution

    Ensure that the Kafka broker settings persistence_size, log_retention_hours, and log_retention_bytes are configured based on your data retention requirements. If the persistent volume reaches its capacity before logs are deleted according to the log retention period configured, Kafka brokers may run out of disk space. For more details on managing Kafka log retention and cleanup policies, see Managing Kafka logs with delete and compact policies.

    kafka_configurations > log_retention_hours

    Conditional Mandatory

    • Specify the number of hours to retain Kafka logs before they are deleted.

    • Default: 168 (7 days)

    kafka_configurations > log_retention_bytes

    Conditional Mandatory

    • Specify the maximum size of Kafka logs (in bytes) before they are deleted.

    • Default: -1 (unlimited)

    kafka_configurations > log_segment_bytes

    Conditional Mandatory

    • Specify the maximum size of Kafka log segments (in bytes) before they are deleted.

    • Default: 1073741824 (1 GB)

    kafka_configurations > topic_partitions

    Conditional Mandatory

    • Specify the partition counts for the following topics: * idrac * ldms * ome

    • Default partition counts: idrac=1, ldms=2, ome=1

    • Example:

      topic_partitions:

      name: “idrac”

      partitions: 1

      name: “ldms”

      partitions: 2

    LDMS port configurations > ldms_agg_port

    Conditional Mandatory

    • Specify the aggregator port to be used on the service k8s cluster.

    • Valid range: 6001-6100

    • Default: 6001

    LDMS port configurations > ldms_store_port

    Conditional Mandatory

    • Specify the store daemon port to be used on the service k8s cluster

    • The port can be the same as LDMS aggregator port specified for ldms_agg_port

    • Valid range: 6001-6100

    • Default: 6001

    LDMS port configurations > ldms_sampler_port

    Conditional Mandatory

    • Specify the sampler port to be used on the compute nodes.

    • Valid range: 10001-10100

    • Default: 10001

    ldms_sampler_configurations > meminfo

    Mandatory

    • Collects memory usage statistics (free, used, buffers, cached, etc.).

    • plugin_name: meminfo

    • config_parameters ""

    • activation_parameters:interval=1000000 indicates memory data metrics collected every 1 second.

    • The interval unit is microseconds

    ldms_sampler_configurations > procstat2

    Mandatory

    • Collects process statistics (CPU, memory, I/O per process).

    • plugin_name: procstat2

    • config_parameters ""

    • activation_parameters: interval=1000000

    • The interval unit is microseconds

    ldms_sampler_configurations > vmstat

    Mandatory

    • Collects virtual memory statistics (paging, swapping, memory pressure).

    • plugin_name: vmstat

    • config_parameters ""

    • activation_parameters: interval=1000000

    • The interval unit is microseconds

    ldms_sampler_configurations > loadavg

    Mandatory

    • Collects system load average (1, 5, and 15 minute averages).

    • plugin_name: loadavg

    • config_parameters: ""

    • activation_parameters interval=1000000

    • The interval unit is microseconds

    ldms_sampler_configurations > procnetdev2

    Mandatory

    • Collects network interface statistics (bytes, packets, errors, drops per interface)

    • The possible config parameters are:
      • ifaces=eth0,eth1: Specific interfaces to monitor

      • If not specified, all network interfaces will be monitored

    • plugin_name: procnetdev2

    • config_parameters: “”

    • activation_parameters: interval=1000000 offset=0

    • The interval unit is microseconds

If you have any feedback about Omnia documentation, please reach out at omnia.readme@dell.com.