Prometheus exporter for a Infiniband Fabric

guilbaults, updated 🕥 2022-04-28 15:30:37

Infiniband-Exporter

Prometheus exporter for a Infiniband fabric. This exporter only need to be installed on one server connected to the fabric, it will collect all the ports statistics on all the switches.

Metrics are identified by type, port number, switch GUID and name. The remote connection of each port is also collected. Thus each metric represents a cable between 2 switches, or between a switch and a card in a server.

When a node name map file is provided, it will be used by ibqueryerrors to put a more human friendly name on switches.

This exporter takes 3 seconds to collect the information of 60+ IB switches, and 900+ compute nodes. The information takes about 7.5MB in ASCII format for that fabric.

Grafana dashboard example

Requirements

  • python3
  • prometheus-client (need to be installed with pip)
  • ibqueryerrors

Usage

Metrics are exported on the chosen HTTP port, events like counter reset will be on STDOUT.

``` usage: infiniband-exporter.py [-h] [--port PORT] [--can-reset-counter] [--from-file INPUT_FILE] [--node-name-map NODE_NAME_MAP] [--ca_name CA_NAME] [--verbose]

Prometheus collector for a infiniband fabric

optional arguments: -h, --help show this help message and exit --port PORT Collector http port, default is 9683 --can-reset-counter Will reset counter as required when maxed out. Can also be set with env variable CAN_RESET_COUNTER --from-file INPUT_FILE Read a file containing the output of ibqueryerrors, if left empty, ibqueryerrors will be launched as needed by this collector --node-name-map NODE_NAME_MAP Node name map used by ibqueryerrors. Can also be set with env var NODE_NAME_MAP --ca_name CA_NAME ibqueryerrors ca_name for different infiniband ports --verbose increase output verbosity ```

Daemon configuration

When using the RPM, some parameters can be set in a file so systemd will pass them to the daemon (infiniband-exporter).

cat /etc/sysconfig/infiniband-exporter.conf NODE_NAME_MAP=/etc/node-name-map CAN_RESET_COUNTER=TRUE

Metrics

InfiniBand exporter metrics are prefixed with "infiniband_".

Global

| Name | Description | | -------------------------------- | -------------------------------------------------------------------------- | | scrape_ok | Indicates with a 1 if the scrape was successful and complete, otherwise 0. | | scrape_duration_seconds | Number of seconds taken to collect and parse the stats. | | ibqueryerrors_duration_seconds | Number of seconds taken to run ibqueryerrors. |

Errors from STDERR by ibqueryerrors

| Name | Labels | Description | | ----------------------- | ---------------------------------------- | ------------------------------------------------------------------- | | bad_status_error | path, status, error | Bad status error catched from STDERR by ibqueryerrors. | | query_failed_error | counter_name, local_name, lid, port | Failed query catched from STDERR by ibqueryerrors. | | mad_rpc_failed_error | portid | ibwarn_mad_rpc error catched from STDERR by ibqueryerrors. | | query_cap_mask_error | counter_name, local_name, portid, port | bwarn_query_cap_mask error catched from STDERR by ibqueryerrors. | | print_error | counter_name, local_name, portid, port | ibwarn_print_error catched from STDERR by ibqueryerrors. |

Channel Adapter (CA) and Switches

For a better readability the counter metric names are shown here in upper camel case.
But when exported the names are displayed in lowercase and the suffix "_total" is appended.

Labels list:

  • component
  • local_name
  • local_guid
  • local_port
  • remote_guid
  • remote_port
  • remote_name

Error Counter

| Name | Description | | ----------------------------- | --------------------------------------------------------------------------------------------------------------------- | | LinkDownedCounter | Total number of times the Port Training state machine has failed the link error recovery process and downed the link. | | SymbolErrorCounter | Total number of minor link errors detected on one or more physical lanes. | | PortXmitConstraintErrors | Total number of packets not transmitted from the switch physical port | PortMalformedPktErrors | Total number of malformed packets | PortSwLifetimeLimitDiscards | Total number of lifetime limit discards | PortXmitDiscards | Total number of outbound packets discarded by the port because the port is down or congested. | | PortSwHOQLifetimeLimitDiscards| Total number of outbound packets discarded because they ran into a head-of-Queue timeout. | | PortBufferOverrunErrors | Total number of packets received on the part discarded due to buffer overrrun. | | PortLocalPhysicalErrors | Total number of packets received with physical error like CRC error. | | PortRcvRemotePhysicalErrors | Total number of packets marked with the EBP delimiter received on the port. | | PortInactiveDiscards | Total number of packets discarded due to the port being in the inactive state. | | PortDLIDMappingErrors | Total number of packets on the port that could not be forwared by the switch due to DLID mapping errors. | | LinkErrorRecoveryCounter | Total number of times the Port Training state machine has successfully completed the link error recovery process. | | LocalLinkIntegrityErrors | The number of times that the count of local physical errors exceeded the threshold specified by LocalPhyErrors. | | VL15Dropped | The number of incoming VL15 packets dropped due to resource limitations (for example, lack of buffers) in the port. | | PortNeighborMTUDiscards | Total outbound packets discarded by the port because packet length exceeded the neighbor MTU. | | PortRcvConstraintErrors | Total number of packets received on the port that are discarded for any of the following reasons: - FilterRawInbound is true and packet is raw - PartitionEnforcementInbound is true and packet fails partition key check, IP version check, or transport header version check. | | ExcessiveBufferOverrunErrors | The number of times that consecutive flow control update periods had at least one overrun error. |

Informative Counter

| Name | Description | | --------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | PortXmitWait | The number of ticks during which the port had data to transmit but no data was sent during the entire tick (either because of insufficient credits or because of lack of arbitration). | | PortXmitData | The total number of data octets, divided by 4, (counting in double words, 32 bits), transmitted on all VLs from the port. | | PortRcvData | The total number of data octets, divided by 4, (counting in double words, 32 bits), received on all VLs from the port. | | PortXmitPkts | Total number of packets transmitted on all VLs from this port. This may include packets with errors. | | PortRcvPkts | Total number of packets (this may include packets containing Errors | | PortRcvErrors | Total number of packets containing an error that were received on the port. | | PortUnicastXmitPkts | Total number of unicast packets transmitted on all VLs from the port. This may include unicast packets with errors. | | PortUnicastRcvPkts | Total number of unicast packets, including unicast packets containing errors. | | PortMulticastXmitPkts | Total number of multicast packets transmitted on all VLs from the port. This may include multicast packets with errors. | | PortMulticastRcvPkts | Total number of multicast packets, including multicast packets containing errors. |

Informative Gauges

| Name | Description | | ----- | ---------------------------- | | speed | Link current speed per lane. | | width | Lanes per link. |

Example

```

HELP infiniband_linkdownedcounter_total Total number of times the Port Training state machine has failed the link error recovery process and downed the link.

TYPE infiniband_linkdownedcounter_total counter

infiniband_linkdownedcounter_total{component="switch",local_guid="0x506b4b03005d3101",local_name="switch1",local_port="2",remote_guid="0x506b4b0300e5e461",remote_name="node1 mlx5_0",remote_port="1"} 1.0 infiniband_linkdownedcounter_total{component="switch",local_guid="0x506b4b03005d3101",local_name="switch1",local_port="3",remote_guid="0x506b4b0300c35b61",remote_name="node2 mlx5_0",remote_port="1"} 1.0 infiniband_linkdownedcounter_total{component="ca",local_guid="0x506b4b0300e5e461",local_name="node1.mlx5_0",local_port="1",remote_guid="0x506b4b03005d3101",remote_name="SwitchX - Mellanox Technologies",remote_port="2"} 1.0 [...]

HELP infiniband_portrcvdata_total Total number of data octets, divided by 4 (lanes), received on all VLs.

TYPE infiniband_portrcvdata_total counter

infiniband_portrcvdata_total{component="switch",local_guid="0x506b4b03005d3101",local_name="switch1",local_port="2",remote_guid="0x506b4b0300e5e461",remote_name="node1 mlx5_0",remote_port="1"} 5.149057134655e+012 infiniband_portrcvdata_total{component="switch",local_guid="0x506b4b03005d3101",local_name="switch1",local_port="3",remote_guid="0x506b4b0300c35b61",remote_name="node2 mlx5_0",remote_port="1"} 6.051662505593e+012 ```

Issues

Nodes without errors are not reported

opened on 2022-01-25 13:58:06 by jbd

Hello,

I found that some nodes where missing from my grafana panels. I've converged to the behavior of ibqueryerrors which is not reporting node information if its not a "bad" node (a node with errors).

For example, here is the report for a node without errors: ```

ibqueryerrors --verbose --details --data --report-port --switch --ca --threshold-file ./error_thresholds -G 0xb8cef60300a1d92a

Summary: 1 nodes checked, 0 bad nodes found

1 ports checked, 0 ports have errors beyond threshold

Thresholds: [SymbolErrorCounter = 0][LinkErrorRecoveryCounter = 0][LinkDownedCounter = 0][PortRcvErrors = 0][PortRcvRemotePhysicalErrors = 0][PortRcvSwitchRelayErrors = 0][PortXmitDiscards = 0][PortXmitConstraintErrors = 0][PortRcvConstraintErrors = 0][LocalLinkIntegrityErrors = 0][ExcessiveBufferOverrunErrors = 0][VL15Dropped = 0][PortXmitWait = 0]

Suppressed:

```

And the report for a 'bad' node:

```

ibqueryerrors --verbose --details --data --report-port --switch --ca --threshold-file ./error_thresholds -G 0x0c42a1030079989c

Errors for "maestro-3002 HCA-1" GUID 0xc42a1030079989c port 1: [PortXmitWait == 2544] [PortXmitData == 6399401 (24.412MB)] [PortRcvData == 1758872 (6.710MB)] [PortXmitPkts == 13959 (13.632K)] [PortRcvPkts == 13514 (13.197K)] [PortUnicastXmitPkts == 13959 (13.632K)] [PortUnicastRcvPkts == 13514 (13.197K)] Link info: 155 1[ ] ==( 4X 53.125 Gbps Active/ LinkUp)==> [ ] "" ( )

Summary: 1 nodes checked, 1 bad nodes found

1 ports checked, 1 ports have errors beyond threshold

Thresholds: [SymbolErrorCounter = 0][LinkErrorRecoveryCounter = 0][LinkDownedCounter = 0][PortRcvErrors = 0][PortRcvRemotePhysicalErrors = 0][PortRcvSwitchRelayErrors = 0][PortXmitDiscards = 0][PortXmitConstraintErrors = 0][PortRcvConstraintErrors = 0][LocalLinkIntegrityErrors = 0][ExcessiveBufferOverrunErrors = 0][VL15Dropped = 0][PortXmitWait = 0]

Suppressed:

```

Indeed, the 'good' node does not report any errors at the moment:

```

perfquery -G 0xb8cef60300a1d92a 1

Port counters: Lid 160 port 1 (CapMask: 0x5A00)

PortSelect:......................1 CounterSelect:...................0x0000 SymbolErrorCounter:..............0 LinkErrorRecoveryCounter:........0 LinkDownedCounter:...............0 PortRcvErrors:...................0 PortRcvRemotePhysicalErrors:.....0 PortRcvSwitchRelayErrors:........0 PortXmitDiscards:................0 PortXmitConstraintErrors:........0 PortRcvConstraintErrors:.........0 CounterSelect2:..................0x00 LocalLinkIntegrityErrors:........0 ExcessiveBufferOverrunErrors:....0 QP1Dropped:......................0 VL15Dropped:.....................0 PortXmitData:....................14804777 PortRcvData:.....................4168543 PortXmitPkts:....................32281 PortRcvPkts:.....................31220 PortXmitWait:....................0 ```

In that case, I guess infiniband-exporter.py cannot extract information for this node. I can see the equivalent information from the other side of the link, using remote_name, so I can workaround it if I really need to retrieve the values. But it somehow break the global view of the fabric I've build in grafana, since I can miss nodes without errors.

Maybe I've missed something ? If not, do you have a suggestion ?

ModuleNotFoundError: No module named 'prometheus_client'

opened on 2021-08-11 19:00:09 by gbeyer3

Hello, I'm getting an error when attempting to start the infiniband exporter service:

Started Infiniband_exporter. Aug 11 14:20:42 \ python3[374627]: Traceback (most recent call last): Aug 11 14:20:42 \ python3[374627]: File "/usr/bin/infiniband_exporter.py", line 12, in Aug 11 14:20:42 \ python3[374627]: from prometheus_client.core import CounterMetricFamily, GaugeMetricFamily Aug 11 14:20:42 \ python3[374627]: ModuleNotFoundError: No module named 'prometheus_client' Aug 11 14:20:42 \ systemd[1]: infiniband_exporter.service: main process exited, code=exited, status=1/FAILURE Aug 11 14:20:42 \ systemd[1]: Unit infiniband_exporter.service entered failed state. Aug 11 14:20:42 \ systemd[1]: infiniband_exporter.service failed.

Is there a dependency that I am missing - CounterMetricFamily, GaugeMetricFamily ? If so, where do I get it?

i'm using v 0.0.4

Thanks

Processing of query failed errors from STDERR retrieved by ibqueryerrors

opened on 2021-06-23 14:05:09 by gabrieleiannetti

We also see errors on STDERR from ibqueryerrors that indicate collecting metrics for specific CA are failing.

In the end the errors could be collected as we did with the bad status errors.
Probably something like a query failed error metric...

The exporter prints the following right now:
2021-06-23 14:49:39,353 - ERROR - Could not process line from STDERR: ibwarn: [1203] query_and_dump: PortXmitDiscardDetails query failed on HOSTXXX, Lid 1051 port 1 ...

query_smp.c:199; Connection timed out

opened on 2021-03-31 14:31:36 by MGlants

After commit c6eef51246a91ccf81f2364c45ca7dfc9d266fc7 I got src/query_smp.c:199; umad (DR path slid 0; dlid 0; 0,1,10,19 Attr 0x11:0) bad status 110; Connection timed out infiniband_scrape_ok 0.0

Releases

infiniband-exporter v0.0.6 2022-01-20 19:05:51

Adding PortXmitConstraintErrors, PortMalformedPktErrors and PortSwLifetimeLimitDiscards

infiniband-exporter v0.0.5 2021-11-22 19:17:04

Implement PortSwHOQLifetimeLimitDiscards metric

infiniband-exporter v0.0.4 2021-08-02 14:34:08

  • Deprecated python2
  • Adding client HCA stats
  • Adding scrape errors handling

infiniband-exporter v0.0.3 2021-04-09 14:42:53

  • Adding ca_name option
  • Adding a real logging output instead of print()
  • Adding scrape duration and status
  • Detect when ibqueryerrors is not executable

infiniband-exporter v0.0.2 2020-03-30 18:25:30

Fixing counter reset using python subprocess.Popen()

infiniband-exporter v0.0.1 2020-03-27 17:54:21

Initial RPM release.

Simon Guilbault
GitHub Repository

prometheus-exporter infiniband-monitoring hpc-clusters