Page 1 of 1

ixgbe driver reports "Detected Tx Unit Hang" for a network interface and network connectivity is lost

Posted: 2020/04/22 17:40:18
by zvioloni
Here is history of the case:

We had loss of communication on network interface eno2 that is communicating with devices on February of 2020. After going through logs we found that there is network interface reset due to hang tx state. Here is except from the log pointing to the issue:

essages:Feb 11 15:01:22 pnc kernel: ixgbe 0000:65:00.1 eno2: Detected Tx Unit Hang #012 Tx Queue <17>#012 TDH, TDT <13c>, <1ce>#012 next_to_use <13e>#012 next_to_clean <13c>#012tx_buffer_info[next_to_clean]#012 time_stamp <12992c627>#012 jiffies <12992dc55>
messages:Feb 11 15:01:22 pnc kernel: ixgbe 0000:65:00.1 eno2: tx hang 1 detected on queue 17, resetting adapter
messages:Feb 11 15:01:22 pnc kernel: ixgbe 0000:65:00.1 eno2: initiating reset due to tx timeout
messages:Feb 11 15:01:22 pnc kernel: ixgbe 0000:65:00.1 eno2: Reset adapter
messages:Feb 11 15:01:25 pnc kernel: ixgbe 0000:65:00.1 eno2: NIC Link is Up 100 Mbps, Flow Control: None
messages:Feb 11 15:01:25 pnc NetworkManager[1231]: <info> [1581433285.1667] device (eno2): carrier: link connected

After searching Redhat database, I found this solution:
With kernels older than 3.10.0-514.el7, the problem was reported to stop occurring when the Scatter-Gather offload engine on the affected interface was disabled via ethtool:
Raw
# ethtool -K <interface> sg off

After applying this workaround we still had occurrences of the same nature.

One detail that was missed is that kernels that are newer then the above mentioned have some backported changes that are making them vulnerable to same issue even with the fix applied.

We had kernel version 3.10.0-1062.1.2.el7.x86_64 which is susceptible to error:

• With RHEL 7.6 (kernel-3.10.0-957.el7) there have been new reports of `ixgbe: Detected Tx Unit Hang'. It is believed that at least some of these hangs are caused by an issue introduced by the following upstream commit which was backported to RHEL 7.6:
o ixgbe: Update adaptive ITR algorithm
• A followup upstream commit ixgbe: Prevent u8 wrapping of ITR value to something less than 10us is believed to resolve any problem introduced by ixgbe: Update adaptive ITR algorithm. This fix has been backported to kernel-3.10.0-1062.7.1.el7 and newer kernels.
We updated system (yum update) and new kernel version is now 3.10.0-1062.12.1.el7.x86_64 which should not be susceptible to tx hangs anymore. We recorded another instance of the same even with the newer kernel:

Apr 14 21:35:03 pnc kernel: ixgbe 0000:65:00.1 eno2: Detected Tx Unit Hang #012 Tx Queue <15>#012 TDH, TDT <159>, <30>#012 next_to_use <15b>#012 next_to_clean <159>#012tx_buffer_info[next_to_clean]#012 time_stamp <13380fce2>#012 jiffies <133810e79>
Apr 14 21:35:03 pnc kernel: ixgbe 0000:65:00.1 eno2: tx hang 1 detected on queue 15, resetting adapter
Apr 14 21:35:03 pnc kernel: ixgbe 0000:65:00.1 eno2: initiating reset due to tx timeout
Apr 14 21:35:03 pnc kernel: ixgbe 0000:65:00.1 eno2: Reset adapter
Apr 14 21:35:05 pnc kernel: ixgbe 0000:65:00.1 eno2: NIC Link is Up 100 Mbps, Flow Control: None
Apr 14 21:35:05 pnc NetworkManager[1228]: <info> [1586900105.5074] device (eno2): carrier: link connected

There were no other changes in the configuration of wither system or the network.

Can anybody provide any advice as what to do next?