Comment 18 for bug 1857074

Revision history for this message
Juerg Haefliger (juergh) wrote : Re: [Bug 1857074] Re: Cavium ThunderX CN88XX Panic : Unknown reason

On Thu, 16 Jan 2020 02:14:16 -0000
dann frazier <email address hidden> wrote:

> I built a kernel with the proposed patches[*] and ran a reboot/kernel
> compile test on 4 systems. The tests survived 46 total iterations
> (~12/system) before I interrupted. Two systems failed with "Synchronous
> External Abort: synchronous parity or ECC error" errors.
>
> I've reverted the systems back to 4.15.0-70 - the kernel before the
> cpufeature/errata patches that caused this - to see if these SEA errors
> are a regression.
>
> [*] https://lists.ubuntu.com/archives/kernel-
> team/2020-January/106909.html
>

I've ran 75 iterations of reboot/compile-kernel and encountered 3 gcc
segmentation faults. Unfortunately, my test didn't capture the dmesg log but
it's likely that these are due to the ECC problems we're (still?) seeing.

There was also another issue during one of the reboots which is probably
unrelated and due to a flaky BMC:

[ 33.896320] ipmi_ssif 0-0012: IPMI message handler: device id demangle
failed: -22 [ 33.896354] ipmi_ssif 0-0012: Unable to get the device id: -5
[ 33.987825] ipmi_ssif 0-0012: Found new BMC (man_id: 0x000000, prod_id:
0xaabb, dev_id: 0x20) [ 33.987858] Unable to handle kernel read from
unreadable memory at virtual address 00000018 [ 33.999300] Mem abort info:
[ 34.005475] ESR = 0x96000004
[ 34.011454] Exception class = DABT (current EL), IL = 32 bits
[ 34.020168] SET = 0, FnV = 0
[ 34.025893] EA = 0, S1PTW = 0
[ 34.031617] Data abort info:
[ 34.037060] ISV = 0, ISS = 0x00000004
[ 34.043448] CM = 0, WnR = 0
[ 34.048949] user pgtable: 4k pages, 48-bit VAs, pgd = 000000002799ee91
[ 34.058063] [0000000000000018] *pgd=0000000000000000
[ 34.065624] Internal error: Oops: 96000004 [#1] SMP
[ 34.073090] Modules linked in: nls_iso8859_1 sch_fq_codel thunderx_zip
thunderx_edac ib_iser cavium_rng_vf rdma_cm ipmi_ssif(+) ipmi_devintf shpchp
cavium_rng iw_cm ipmi_msghandler ib_cm gpio_keys uio_pdrv_genirq uio ib_core
iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4
btrfs zstd_compress raid10 raid456 libcrc32c async_raid6_recov async_memcpy
async_pq async_xor async_tx xor raid6_pq raid1 raid0 multipath linear ast
i2c_algo_bit drm_kms_helper nicvf syscopyarea sysfillrect sysimgblt fb_sys_fops
ttm nicpf drm aes_ce_blk aes_ce_cipher crc32_ce crct10dif_ce ghash_ce sha2_ce
sha256_arm64 sha1_ce ahci thunder_bgx libahci i2c_thunderx thunder_xcv
mdio_thunder thunderx_mmc mdio_cavium aes_neon_bs aes_neon_blk crypto_simd
cryptd aes_arm64 [ 34.161807] Process kworker/64:1 (pid: 651, stack limit =
0x00000000b0697881) [ 34.172016] CPU: 64 PID: 651 Comm: kworker/64:1 Not
tainted 4.15.18+ #40 [ 34.181723] Hardware name: Cavium ThunderX CRB/To be
filled by O.E.M., BIOS 5.11 12/12/2012 [ 34.193113] Workqueue: events
redo_bmc_reg [ipmi_msghandler] [ 34.201840] pstate: 80400005 (Nzcv daif +PAN
-UAO) [ 34.209589] pc : smi_send.isra.4+0x80/0x158 [ipmi_msghandler] [
34.218275] lr : smi_send.isra.4+0x150/0x158 [ipmi_msghandler] [ 34.227046] sp
: ffff0000128c3b10 [ 34.233209] x29: ffff0000128c3b10 x28: 0000000000000020 [
  34.241305] x27: 0000000000000002 x26: 0000000000000000 [ 34.249437] x25:
ffff0000128c3c40 x24: ffff0000128c3c38 [ 34.257455] x23: 0000000000000000
x22: 0000000000000018 [ 34.265500] x21: 0000000000000000 x20:
ffff810fb16c8800 [ 34.273558] x19: ffff800faffb0000 x18: ffffffffffffffff [
34.281643] x17: 0000000000000005 x16: 0000000000000000 [ 34.289758] x15:
ffff000009578c08 x14: ffff810fb0d20187 [ 34.297899] x13: ffff810fb0d20186
x12: 0000000000000030 [ 34.305997] x11: 0101010101010101 x10:
ffff7f7f7f7f7f7f [ 34.314069] x9 : fefdfefefefefeff x8 : ffff810fb16c8800 [
34.322166] x7 : 0000000000001138 x6 : 000000000000125c [ 34.330300] x5 :
00000000000000dc x4 : ffff810fbc8f1340 [ 34.338456] x3 : 0000000000000000 x2
: 0000000000000000 [ 34.346633] x1 : ffff810fb16c8800 x0 : ffff810fae4ff800 [
  34.354839] Call trace: [ 34.360207] smi_send.isra.4+0x80/0x158
[ipmi_msghandler] [ 34.368450] i_ipmi_request+0x2ac/0x980 [ipmi_msghandler]
[ 34.376716] send_channel_info_cmd+0xac/0xd8 [ipmi_msghandler] [
34.385396] __scan_channels.isra.20+0x84/0x180 [ipmi_msghandler] [ 34.394341]
 __bmc_get_device_id+0x424/0x8c8 [ipmi_msghandler] [ 34.402994]
redo_bmc_reg+0x6c/0x70 [ipmi_msghandler] [ 34.410840]
process_one_work+0x1e0/0x420 [ 34.417640] worker_thread+0x4c/0x478 [
34.420416] IPv6: ADDRCONF(NETDEV_UP): enP2p1s0f2: link is not ready [
34.424073] kthread+0x134/0x138 [ 34.424081] ret_from_fork+0x10/0x18 [
34.424089] Code: f908aa74 b4ffff74 f9424e60 aa1403e1 (f94002c2) [ 34.454826]
---[ end trace b54ad269f357375f ]--- [ 34.467956] ipmi_ssif: Unable to
register device: error -5 [ 34.476380] ipmi_ssif 0-0012: Unable to start IPMI
SSIF: -5 [ 34.484925] ipmi_ssif: probe of 0-0012 failed with error -5