[thunderx] Synchronous External Abort: synchronous parity or ECC error

Bug #1860013 reported by dann frazier
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Triaged
Undecided
Unassigned
Bionic
Confirmed
Undecided
Unassigned
Disco
Won't Fix
Undecided
Unassigned
Eoan
Triaged
Undecided
Unassigned
Focal
Triaged
Undecided
Unassigned

Bug Description

[Impact]
Under load, ThunderX systems eventually fail with:

[ 282.360376] Synchronous External Abort: synchronous parity or ECC error (0x96000018) at 0x0000ffffa6eb7000
[ 282.372351] Internal error: : 96000018 [#1] SMP
[ 282.379152] Modules linked in: nls_iso8859_1 thunderx_edac thunderx_zip shpchp cavium_rng_vf cavium_rng gpio_keys uio_pdrv_genirq uio ipmi_ssif ipmi_devintf ipmi_msghandler sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear nicvf nicpf uas usb_storage ast i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt aes_ce_blk fb_sys_fops aes_ce_cipher drm crc32_ce crct10dif_ce ghash_ce sha2_ce sha256_arm64 sha1_ce ahci libahci thunder_bgx thunder_xcv i2c_thunderx mdio_thunder thunderx_mmc mdio_cavium aes_neon_bs aes_neon_blk crypto_simd cryptd aes_arm64
[ 282.467284] Process cc1 (pid: 39700, stack limit = 0x00000000e0c44146)
[ 282.477172] CPU: 25 PID: 39700 Comm: cc1 Not tainted 4.15.0-75-generic #85+lp1857074.1
[ 282.488379] Hardware name: Cavium ThunderX CRB/To be filled by O.E.M., BIOS 5.11 12/12/2012
[ 282.500121] pstate: 80000005 (Nzcv daif -PAN -UAO)
[ 282.508297] pc : __arch_copy_to_user+0x13c/0x248
[ 282.516430] lr : cp_new_stat+0x140/0x178
[ 282.523768] sp : ffff00002e4d3d40
[ 282.530369] x29: ffff00002e4d3d40 x28: ffff801f51fa2d00
[ 282.538988] x27: ffff000008b52000 x26: 0000000000000050
[ 282.548031] x25: 0000000000000124 x24: 0000000000000015
[ 282.556872] x23: 0000000000000000 x22: 000000002e4d3d88
[ 282.565449] x21: ffff801f51fa2d00 x20: ffff000009588000
[ 282.574109] x19: ffff00002e4d3e30 x18: 0000ffffa87e7a70
[ 282.582790] x17: 0000ffffa8756110 x16: ffff0000082f4448
[ 282.591433] x15: 0000000000000000 x14: 0000000000000012
[ 282.599986] x13: 00682e6c746e6366 x12: 2f78756e696c2f69
[ 282.608730] x11: 0000000000000000 x10: 0000000000000cf0
[ 282.617283] x9 : 0000000000001000 x8 : 00000001000081a4
[ 282.625839] x7 : 0000000001001a2b x6 : 000000002e4d3da0
[ 282.634238] x5 : 000000002e4d3e08 x4 : 0000000000000008
[ 282.642754] x3 : 0000000000000802 x2 : fffffffffffffff8
[ 282.651250] x1 : ffff00002e4d3d90 x0 : 000000002e4d3d88
[ 282.660013] Call trace:
[ 282.665421] __arch_copy_to_user+0x13c/0x248
[ 282.672979] SyS_newfstat+0x58/0x88
[ 282.679272] el0_svc_naked+0x30/0x34
[ 282.685605] Code: a8c12027 a88120c7 d503201f d503201f (a8c12829)
[ 282.694411] ---[ end trace 863693cf0c3fd297 ]---

[Test Case]
We found this by doing a reboot/kernel build loop. (The reboot maybe unnecessary). Code to automate this setup is at:
  https://code.launchpad.net/~dannf/+git/kernel-build-reboot-loop

[Fix]
[Regression Risk]

dann frazier (dannf)
Changed in linux (Ubuntu Bionic):
status: New → Confirmed
Changed in linux (Ubuntu Disco):
status: New → Triaged
Changed in linux (Ubuntu Eoan):
status: New → Triaged
Changed in linux (Ubuntu Focal):
status: New → Triaged
Revision history for this message
dann frazier (dannf) wrote :

Also reproducible w/ the 5.0.0-37.40 kernel. I'll try a mainline 5.5-rc6 build next.

[ 602.796765] Internal error: synchronous parity or ECC error: 96000018 [#1] SMP
[ 602.803994] Modules linked in: nls_iso8859_1 cavium_rng_vf ipmi_ssif ipmi_devintf input_leds joydev ipmi_msghandler thunderx_edac cavium_rng sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq libcrc32c raid1 raid0 multipath linear aes_ce_blk aes_ce_cipher nicvf cavium_ptp ast i2c_algo_bit ttm drm_kms_helper crct10dif_ce ghash_ce syscopyarea sysfillrect sha2_ce sysimgblt uas hid_generic nicpf fb_sys_fops sha256_arm64 drm sha1_ce usbhid usb_storage hid thunder_bgx ahci thunder_xcv i2c_thunderx mdio_thunder thunderx_mmc mdio_cavium aes_neon_bs aes_neon_blk crypto_simd cryptd aes_arm64
[ 602.872414] Process cc1 (pid: 40126, stack limit = 0x0000000090887c2f)
[ 602.878949] CPU: 10 PID: 40126 Comm: cc1 Not tainted 5.0.0-37-generic #40~18.04.1-Ubuntu
[ 602.887040] Hardware name: GIGABYTE R120-T33/MT30-GS1, BIOS T49 02/02/2018
[ 602.893921] pstate: 80000005 (Nzcv daif -PAN -UAO)
[ 602.898724] pc : __arch_copy_to_user+0x13c/0x248
[ 602.903353] lr : cp_new_stat+0x140/0x178
[ 602.907277] sp : ffff00002599bcc0
[ 602.910594] x29: ffff00002599bcc0 x28: ffff800ed0538ec0
[ 602.915912] x27: 0000000000000000 x26: 0000000000000000
[ 602.921229] x25: 0000000056000000 x24: 0000000000000015
[ 602.926547] x23: ffff000010c716d8 x22: 000000002599bd08
[ 602.931865] x21: ffff800ed0538ec0 x20: ffff00001170c000
[ 602.937181] x19: ffff00002599bdb0 x18: 0000000000000000
[ 602.942498] x17: 0000000000000000 x16: 0000000000000000
[ 602.947818] x15: 0000000000000000 x14: 0000000000000000
[ 602.953134] x13: 0000000000000000 x12: 0000000000000000
[ 602.958452] x11: 0000000000000000 x10: 000000000000152f
[ 602.963769] x9 : 0000000000001000 x8 : 00000001000081a4
[ 602.969087] x7 : 0000000000a60da3 x6 : 000000002599bd20
[ 602.974405] x5 : 000000002599bd88 x4 : 0000000000000008
[ 602.979721] x3 : 0000000000000802 x2 : fffffffffffffff8
[ 602.985038] x1 : ffff00002599bd10 x0 : 000000002599bd08
[ 602.990356] Call trace:
[ 602.992821] __arch_copy_to_user+0x13c/0x248
[ 602.997107] __se_sys_newfstat+0x58/0x88
[ 603.001045] __arm64_sys_newfstat+0x20/0x30
[ 603.005243] el0_svc_common+0x88/0x180
[ 603.009005] el0_svc_handler+0x38/0x78
[ 603.012770] el0_svc+0x8/0xc
[ 603.015664] Code: a8c12027 a88120c7 d503201f d503201f (a8c12829)
[ 603.021765] ---[ end trace 08068f2978fb8211 ]---

Revision history for this message
dann frazier (dannf) wrote :

All 3 of my machines survived overnight testing on the 5.5-rc6 mainline build[*].
Next step is to try 5.3. 5.3 mainline doesn't boot on these systems, so I'll use Ubuntu's 5.3.0-24.

[*] https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.5-rc6/

Revision history for this message
dann frazier (dannf) wrote :
Revision history for this message
dann frazier (dannf) wrote :
Revision history for this message
dann frazier (dannf) wrote :
Revision history for this message
dann frazier (dannf) wrote :
Revision history for this message
dann frazier (dannf) wrote :
Revision history for this message
dann frazier (dannf) wrote :

I attempted to bisect this, using the following process:
  - Run the kernel-build-reboot-loop test on 3 machines in parallel
    I used 2 CRB1S systems (anuchin, bestovius) and 1 R120-T33 (seidel)
  - If any machine crashes w/ the parity error message, consider it failed
  - If all machines survive over night, consider it "OK".

Unfortunately, the commit it landed on looks bogus:

# first bad commit: [852643165aea0999bb862b36511c5b9f6b11449f] fs//binfmt_elf.c: move variables initialization closer to their usage
(Reverse bisect - this would in theory be the commit that *fixed* it)

Just in case, I tried reverting that commit from 5.5-rc6. As noted in comment #2, 5.5-rc6 seems immune to this problem. Reverting the commit didn't change that - 5.5-rc6 still survived over night.

Note: Of the 3 systems, anuchin was usually the one that failed during the bisect. It could be that this is a generic hw issue, and anuchin is just more severely impacted than the others. It could also be that this symptom can be caused by both a sw and a hw issue, and anuchin is impacted by the hw part, making it a bad choice for a bisect. Either way, bisection seems like a poor strategy for identifying the issue.

Revision history for this message
dann frazier (dannf) wrote :

Looking at the git log - I wonder if this could be related?

commit 94bb804e1e6f0a9a77acf20d7c70ea141c6c821e
Author: Pavel Tatashin <email address hidden>
Date: Tue Nov 19 17:10:06 2019 -0500

    arm64: uaccess: Ensure PAN is re-enabled after unhandled uaccess fault

It's interesting because ThunderX is somewhat unique in our test cluster as not having HW PAN.
We also only recently merged this into our 4.15 tree - Ubuntu-4.15.0-73.82 was the first tree to have it. I'll restart testing on our latest 4.15 (w/ this patch) to see if the issue persists.

Revision history for this message
dann frazier (dannf) wrote :

The patch I highlighted in Comment #9 appears to be unrelated - 4.15.0-76 still fails even though it has the patch. A test build of 4.15.0-76 w/ the patch reverted also fails.

dann frazier (dannf)
Changed in linux (Ubuntu Disco):
status: Triaged → Won't Fix
Revision history for this message
Juerg Haefliger (juergh) wrote :

Is this related to the boot regression that we fixed or a different problem? I.e., can you still reproduce this with latest Bionic?

Revision history for this message
dann frazier (dannf) wrote :

@Juery: I have no reason to believe this is related to the boot regression we fixed (bug 1857074). I haven't re-tested lately, but as of 4.15.0-76 it was still reproducible.

Revision history for this message
dann frazier (dannf) wrote :

sorry s/Juery/Juerg/!

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.