Comment 21 for bug 1857074

Revision history for this message
dann frazier (dannf) wrote : Re: [Bug 1857074] Re: Cavium ThunderX CN88XX Panic : Unknown reason

On Wed, Jan 15, 2020 at 11:28 PM Juerg Haefliger
<email address hidden> wrote:
>
> On Thu, 16 Jan 2020 02:14:16 -0000
> dann frazier <email address hidden> wrote:
>
> > I built a kernel with the proposed patches[*] and ran a reboot/kernel
> > compile test on 4 systems. The tests survived 46 total iterations
> > (~12/system) before I interrupted. Two systems failed with "Synchronous
> > External Abort: synchronous parity or ECC error" errors.
> >
> > I've reverted the systems back to 4.15.0-70 - the kernel before the
> > cpufeature/errata patches that caused this - to see if these SEA errors
> > are a regression.
> >
> > [*] https://lists.ubuntu.com/archives/kernel-
> > team/2020-January/106909.html
> >
>
> I've ran 75 iterations of reboot/compile-kernel and encountered 3 gcc
> segmentation faults. Unfortunately, my test didn't capture the dmesg log but
> it's likely that these are due to the ECC problems we're (still?) seeing.

I've seen those on every machine so far when ran long enough. Since I
believe we've clearly demonstrated that this is an unrelated failure,
I've split it out into bug 1860013 - let's track it there.

> There was also another issue during one of the reboots which is probably
> unrelated and due to a flaky BMC:

Let's track that in bug 1857073. Even if it is a flaky BMC, the IPMI
driver should handle the failure gracefully.
Did you see this on host 'wright' as well?

 -dann