Understand "Hardware error from APEI Generic Hardware Error Source: 1"

Issues related to hardware problems
Post Reply
lightingghost
Posts: 1
Joined: 2021/06/13 23:28:29

Understand "Hardware error from APEI Generic Hardware Error Source: 1"

Post by lightingghost » 2021/06/13 23:38:40

My server running CentOS 8 randomly hangs after 2~3 days running, checking the vmcore-dmesg shows the kernel panics because of the follow error:

Code: Select all

[210684.261133] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[210684.261134] {2}[Hardware Error]: event severity: fatal
[210684.261135] {2}[Hardware Error]:  Error 0, type: fatal
[210684.261135] {2}[Hardware Error]:   section_type: PCIe error
[210684.261135] {2}[Hardware Error]:   port_type: 4, root port
[210684.261136] {2}[Hardware Error]:   version: 3.0
[210684.261136] {2}[Hardware Error]:   command: 0x0547, status: 0x4010
[210684.261136] {2}[Hardware Error]:   device_id: 0000:16:01.0
[210684.261137] {2}[Hardware Error]:   slot: 82
[210684.261137] {2}[Hardware Error]:   secondary_bus: 0x18
[210684.261137] {2}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x2031
[210684.261138] {2}[Hardware Error]:   class_code: 000406
[210684.261138] {2}[Hardware Error]:   bridge: secondary_status: 0x0000, control: 0x0013
[210684.261139] {2}[Hardware Error]:   aer_uncor_status: 0x00000020, aer_uncor_mask: 0x00100000
[210684.261139] {2}[Hardware Error]:   aer_uncor_severity: 0x00062030
[210684.261139] {2}[Hardware Error]:   TLP Header: 00000000 00000000 00000000 00000000
[210684.261140] Kernel panic - not syncing: Fatal hardware error!
lspci shows the device 16:01.0

Code: Select all

 
$ lspci -s 16:01.0 -vv
16:01.0 PCI bridge: Intel Corporation Sky Lake-E PCI Express Root Port B (rev 02) (prog-if 00 [Normal decode])
        Physical Slot: 82
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 29
        NUMA node: 0
        Bus: primary=16, secondary=18, subordinate=1b, sec-latency=0
        I/O behind bridge: [disabled]
        Memory behind bridge: 97700000-97afffff [size=4M]
        Prefetchable memory behind bridge: 0000000092000000-00000000972fffff [size=83M]
        Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort+ <SERR- <PERR-
        BridgeCtl: Parity+ SERR+ NoISA- VGA- VGA16+ MAbort- >Reset- FastB2B-
                PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
        Capabilities: <access denied>
        Kernel driver in use: pcieport

Code: Select all

$ lspci -s 16:01.0 -tvv
0000:16:01.0-[18-1b]----00.0-[19-1b]----03.0-[1a-1b]--+-00.0  Intel Corporation Ethernet Connection X722 for 1GbE
                                                      +-00.1  Intel Corporation Ethernet Connection X722 for 1GbE
                                                      +-00.2  Intel Corporation Ethernet Connection X722 for 1GbE
                                                      \-00.3  Intel Corporation Ethernet Connection X722 for 1GbE
My questions is, is this likely something wrong with my motherboard/cpu, or the problem is on the kernel?

Any help will be appreciated.

Post Reply