Comment 8 for bug 1723127

Revision history for this message
Dan Streetman (ddstreet) wrote :

Sorry for the delay.

So we have 2 options on how to continue debugging here:

1. we can try a traditional git bisect. This would involve testing various kernel builds, to try to eventually narrow down the issue to being fixed by a specific commit. It's a long-ish process, depending on how long testing each build takes, and it's critical that verification of 'good' or 'bad' at each step is correct - otherwise the bisect ends at the wrong commit. Each step will involve me building a new kernel, you test with the kernel until it fails or you've tested long enough to be sure that kernel build is 'good'. With hard-to-reproduce problems like this, bisecting can be tough, because if a build doesn't fail for a long time, that doesn't necessarily mean it's "good", it may just not have failed yet, in which case the bisect will end at the wrong commit, which doesn't help with figuring out how to fix anything.

2. Intel has provided me some undocumented commands that will allow controlling what MDD events the nic triggers on. I can provide those instructions, and you can test with each MDD event bit set individually, until the problem reproduces - then we know exactly which MDD source triggered the event, which should help identify what the driver did to cause the MDD event. This way has a much better chance of finding the specific problem, but the downside is you'll need to run undocumented commands with your hardware. I believe there should not be any risk in doing that since the info came from Intel, but I can't personally verify it, as I don't currently have access to this specific NIC.

If you're willing to try #2, I'll add the specific commands/instructions and you can get started testing. Otherwise if you would prefer not to run the undocumented commands, I can start a kernel bisect.