2. Hardware Corrected Errors¶
在服务器串口上发现上报Hareware error,最后发现是内存条有问题。设备可以正常启动OS,但是运行一段时间后会自动重启。
在message中查看到重启记录
May 13 15:05:20 hisilicon11 kernel: EDAC MC0: 1 CE reserved error (16) on unknown label (node:0 rank:0 bank:0 row:0 col:0 page:0x400000 offset:0x300 grain:0 syndrome:0x0)
May 13 15:05:20 hisilicon11 kernel: EDAC MC0: 1 CE reserved error (16) on unknown label (node:0 rank:0 bank:0 row:0 col:0 page:0x400000 offset:0x300 grain:0 syndrome:0x0)
May 13 15:05:20 hisilicon11 kernel: EDAC MC0: 1 CE reserved error (16) on unknown label (node:0 rank:0 bank:0 row:0 col:0 page:0x400000 offset:0x300 grain:0 syndrome:0x0)
May 13 15:05:21 hisilicon11 kernel: EDAC MC0: 1 CE reserved error (16) on unknown label (node:0 rank:0 bank:0 row:0 col:0 page:0x400000 offset:0x300 grain:0 syndrome:0x0)
May 13 15:05:21 hisilicon11 kernel: EDAC MC0: 1 CE reserved error (16) on unknown label (node:0 rank:0 bank:0 row:0 col:0 page:0x400000 offset:0x300 grain:0 syndrome:0x0)
May 13 15:05:21 hisilicon11 kernel: EDAC MC0: 1 CE reserved error (16) on unknown label (node:0 rank:0 bank:0 row:0 col:0 page:0x400000 offset:0x300 grain:0 syndrome:0x0)
May 5 18:18:46 hisilicon11 journal: Runtime journal is using 8.0M (max allowed 4.0G, trying to leave 4.0G free of 255.5G available → current limit 4.0G).
May 5 18:18:46 hisilicon11 kernel: Booting Linux on physical CPU 0x0000080000 [0x481fd010]
May 5 18:18:46 hisilicon11 kernel: Linux version 4.19.28.3-2019-05-13 (lixianfa@ubuntu) (gcc version 5.4.0 20160609 (Ubuntu/Linaro 5.4.0-6ubuntu1~16.04.10)) #2 SMP Mon May 13 10:20:47 CST 2019
May 5 18:18:46 hisilicon11 kernel: efi: Getting EFI parameters from FDT:
May 5 18:18:46 hisilicon11 kernel: efi: EFI v2.70 by EDK II
May 5 18:18:46 hisilicon11 kernel: efi: SMBIOS 3.0=0x3f0f0000 ACPI 2.0=0x39cb0000 MEMATTR=0x3b4bc018 ESRT=0x3f11bc98 RNG=0x3f11bd98 MEMRESERVE=0x39bb4d18
May 5 18:18:46 hisilicon11 kernel: efi: seeding entropy pool
May 5 18:18:46 hisilicon11 kernel: esrt: Reserving ESRT space from 0x000000003f11bc98 to 0x000000003f11bcd0.
May 5 18:18:46 hisilicon11 kernel: crashkernel: memory value expected
May 5 18:18:46 hisilicon
在BIOS启动日子打印NOTICE 可纠正错误
NOTICE: [TotemRasIntMemoryNodeFhi]:[197L]
NOTICE: [MemoryErrorFillInHest]:[245L]ErrorType is CE, ErrorSeverity is CORRECTED. #纠正错误
NOTICE: [IsMemoryError]:[156L]Ierr = 0xf
NOTICE: RASC socket[0]die[3]channel[3] #内存条位置
NOTICE: [GetMemoryErrorDataErrorType]:[103L]Ierr = 0xf
NOTICE: RASC H[0]L[0]
NOTICE: PlatData R[0]B[0] R[0]C[0]
NOTICE: [CollectArerErrorData]:[226L]SysAddr=4000000300: #物理地址
NOTICE: [HestGhesV2ResetAck]:[84L] I[2] CeValid[0]
NOTICE: [HestGhesV2ResetAck]:[84L] Index 2
NOTICE: count[0] Severity[2] CeValid[0]
NOTICE: [HestGhesV2SetGenericErrorData]:[163L] Fill in HEST TABLE ,AckRegister=44010050
NOTICE: [HestNotifiedOS]:[37L]
NOTICE: [TotemRasIntM = 0x0
在系统启动过程中打印Hareware error
[ 27.740329] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
[ 27.753985] {1}[Hardware Error]: It has been coHz, action=0.
[ 27.791954] {1}[Hardware Error]: event severity: corrected
[ 27.791957] {1}[Hardware Error]: Error 0, type: corrected
[ 27.791959] {1}[Hardware Error]: section_type: memory error
[ 27.814227] {1}[Hardware Error]: physical_address: 0x0000004000000300 #同样的物理地址
[ 27.830193] {1}[Hardware Error]: node: 0 rank: 0 bank: 0 row: 0 column: 0
[ 27.830197] {1}[Hardw0 row:0 col:0 page:0x400000 offset:0x300 grain:0 syndrome:0x0)
在OS内部使用edac-utils -v可以查看到可纠正错误。
edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 13 Corrected Errors with no DIMM info #可纠正错误
mc0: csrow0: 0 Uncorrected Errors
mc0: csrow0: mc#0memory#0: 0 Corrected Errors
mc0: csrow10: 0 Uncorrected Errors
mc0: csrow10: mc#0memory#10: 0 Corrected Errors
mc0: csrow12: 0 Uncorrected Errors
mc0: csrow12: mc#0memory#12: 0 Corrected Errors
mc0: csrow14: 0 Uncorrected Errors
mc0: csrow14: mc#0memory#14: 0 Corrected Errors
mc0: csrow16: 0 Uncorrected Errors
mc0: csrow16: mc#0memory#16: 0 Corrected Errors
mc0: csrow18: 0 Uncorrected Errors
mc0: csrow18: mc#0memory#18: 0 Corrected Errors
mc0: csrow2: 0 Uncorrected Errors
mc0: csrow2: mc#0memory#2: 0 Corrected Errors
mc0: csrow20: 0 Uncorrected Errors
mc0: csrow20: mc#0memory#20: 0 Corrected Errors
mc0: csrow22: 0 Uncorrected Errors
mc0: csrow22: mc#0memory#22: 0 Corrected Errors
mc0: csrow24: 0 Uncorrected Errors
mc0: csrow24: mc#0memory#24: 0 Corrected Errors
mc0: csrow26: 0 Uncorrected Errors
mc0: csrow26: mc#0memory#26: 0 Corrected Errors
mc0: csrow28: 0 Uncorrected Errors
mc0: csrow28: mc#0memory#28: 0 Corrected Errors
mc0: csrow30: 0 Uncorrected Errors
mc0: csrow30: mc#0memory#30: 0 Corrected Errors
mc0: csrow4: 0 Uncorrected Errors
mc0: csrow4: mc#0memory#4: 0 Corrected Errors
mc0: csrow6: 0 Uncorrected Errors
mc0: csrow6: mc#0memory#6: 0 Corrected Errors
mc0: csrow8: 0 Uncorrected Errors
mc0: csrow8: mc#0memory#8: 0 Corrected Errors
在OS内部使用dmesg看到重复上报的可纠正错误
[ 2624.662038] {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
[ 2624.662200] {3}[Hardware Error]: It has been corrected by h/w and requires no further action
[ 2624.662396] {3}[Hardware Error]: event severity: corrected
[ 2624.662526] {3}[Hardware Error]: Error 0, type: corrected
[ 2624.662654] {3}[Hardware Error]: section_type: memory error
[ 2624.662784] {3}[Hardware Error]: physical_address: 0x0000004000000300 #同样的物理地址
[ 2624.662941] {3}[Hardware Error]: node: 0 rank: 0 bank: 0 row: 0 column: 0
[ 2624.663102] {3}[Hardware Error]: error_type: 16, unknown
[ 2624.663236] EDAC MC0: 1 CE reserved error (16) on unknown label (node:0 rank:0 bank:0 row:0 col:0 page:0x400000 offset:0x300 grain:0 syndrome:0x0)
[12083.123880] {4}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
[12083.124069] {4}[Hardware Error]: It has been corrected by h/w and requires no further action
[12083.124279] {4}[Hardware Error]: event severity: corrected
[12083.124417] {4}[Hardware Error]: Error 0, type: corrected
[12083.124557] {4}[Hardware Error]: section_type: memory error
[12083.124702] {4}[Hardware Error]: physical_address: 0x0000004000000300
[12083.124870] {4}[Hardware Error]: node: 0 rank: 0 bank: 0 row: 0 column: 0
[12083.125043] {4}[Hardware Error]: error_type: 16, unknown
[12083.125188] EDAC MC0: 1 CE reserved error (16) on unknown label (node:0 rank:0 bank:0 row:0 col:0 page:0x400000 offset:0x300 grain:0 syndrome:0x0)
[12383.322871] {5}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
[12383.323060] {5}[Hardware Error]: It has been corrected by h/w and requires no further action
[12383.323269] {5}[Hardware Error]: event severity: corrected
[12383.323409] {5}[Hardware Error]: Error 0, type: corrected
[12383.323546] {5}[Hardware Error]: section_type: memory error
[12383.323692] {5}[Hardware Error]: physical_address: 0x0000004000000300
[12383.323857] {5}[Hardware Error]: node: 0 rank: 0 bank: 0 row: 0 column: 0
解决办法是:
拔掉BIOS启动中提示的内存条,会发现错误消失。具体是那根内存条,由BIOS和EVB确定。