Merge lp:~ev/apport/reports-from-hangs into lp:apport
| Status: | Merged |
|---|---|
| Merged at revision: | 2423 |
| Proposed branch: | lp:~ev/apport/reports-from-hangs |
| Merge into: | lp:apport |
| Diff against target: |
480 lines (+228/-23) 7 files modified
apport/fileutils.py (+13/-0) apport/ui.py (+71/-2) data/apport (+9/-0) gtk/apport-gtk (+46/-5) test/test_fileutils.py (+17/-0) test/test_ui.py (+34/-16) test/test_ui_gtk.py (+38/-0) |
| To merge this branch: | bzr merge lp:~ev/apport/reports-from-hangs |
| Related bugs: |
| Reviewer | Review Type | Date Requested | Status |
|---|---|---|---|
| Martin Pitt | 2012-06-20 | Approve on 2012-07-04 | |
|
Review via email:
|
|||
Description of the Change
This branch is not ready yet. I need to write tests (I know, I know, tests first) and build support in both the console and KDE interfaces. However, as I press on I want to make sure that you find the general approach to be sensible.
The branch adds the ability to specify that a provided pid is that of a hanging application. In the near future, compiz will call apport with these arguments when it detects that such a condition is occurring.
After review with the security team, it was determined that the safest approach to obtaining a stack trace from these hangs was through the core pipe handler. Therefore, the workflow is as follows:
- Compiz calls apport with --hanging $PID
- Apport collects some basic information to present a dialog:
https:/
- A key is added to the report denoting whether the application needs to be restarted, based on the user's response.
- Apport sends the application SIGSEGV.
- The apport core pipe handler picks this up and identifies that we already have a "Hang" report for this binary. It appends the core file to the existing report.
- update-notifier notices that the file has changed and calls apport-gtk.
- apport-gtk presents no UI, but restarts the application and tells whoopsie to send the report, if these options were selected.
So, what do you think?
Background:
https:/
https:/
https:/
| Martin Pitt (pitti) wrote : | # |
The compiz binary detects that the application that it's painting for is hanging because it measures the time between glPaint calls (lp:compiz plugins/
Perhaps I'm missing something obvious, or misunderstanding how the window manager works? I agree that a complicated state machine is not ideal, and I am definitely open to alternatives.
Now, one could argue that compiz should send SIBABRT to the process, rather than making apport do it and thus simplifying the state that apport has to track. I would like to avoid this if possible. It means we cannot present the dialog while the application is still visible and it would require us to at least rework the text. "Force closed" would likely become "Leave closed", for example:
https:/
Thanks Martin!
| Martin Pitt (pitti) wrote : | # |
Thanks for the explanations about how this is going to work, it makes much more sense now. I'm still pondering how we could avoid writing the report twice, with two different flags; this is so utterly prone to errors, if I had a dollar for each OSError/IOError and what not race condition I already fixed for those.. There might be half-written files, two parallel instances getting into each other's way, etc.
As we cannot augment a signal with additional data, nor intrude into the killed process' memory to set __abort_msg (that would require PTRACE permissions), the next best thing that comes to my mind is to not write a full report, but just an empty flag file into /var/crash that also contains the pid in the file name, which indicates to the next running apport instance that this was a manual SIGABRT due to a hang. That would be slightly more robust.
Otherwise I do not have a better idea as well.
Thanks!
- 2399. By Evan on 2012-06-28
-
Rework hang reporting. We now drop a PID.hanging file in /var/crash, which the core pipe handler uses to assign a Hang problem type to the report it generates. This then hands back off to apport-gtk as before, but only marks the report for uploading. Restarting the application is now handled as the process is being terminated.
- 2400. By Evan on 2012-06-28
-
Merge with trunk.
- 2401. By Evan on 2012-06-28
-
It's hanging. It might not respond to SIGTERM. Send it SIGKILL instead.
Thanks again for the thorough review. I believe I've addressed the remaining issues by implementing your suggestion. The workflow is now as follows:
- Compiz detects a program is hanging and calls apport --hanging $PID.
- The apport UI is presented with some basic information about the hanging application.
- The user presses either the "Force closed" or "Relaunch" button.
- If the "report" box is checked, the ABRT signal is sent to the process and a /var/crash/
- If the user has pressed "Relaunch" then it is at this point that the process is restarted. This will occur after apport has waited for the previous pid to terminate.
- If the "report" box was checked and thus the ABRT signal was sent to the process, /usr/share/
- When apport-gtk is notified of new .crash files to process, it will check their problem type. For any that are "Hang" reports, it will simply mark the .crash file to be uploaded by whoopsie and then mark it as seen.
I'll write some tests and add support to the KDE and console frontends now. I'm not tied to the naming of anything. So if you think there's a better way of labelling any of these things, do let me know.
Correction. It obviously does not make sense to add console support. Likewise, I am not sure what would be entailed in adding support to KDE's compositor, so I'd rather we revisit KDE frontend support when we have a plan there.
- 2405. By Evan on 2012-06-29
-
Upgrade argv parsing tests with the new '--hanging' option.
- 2406. By Evan on 2012-06-29
-
Unit tests save the day. Fix accidentally treating successful loading of the crash report as a failure.
- 2407. By Evan on 2012-06-29
-
The PID option may not be set and does not have a default value.
I think this is finally ready to go. I've added some unit tests and corrected the new code where it was breaking existing ones. Let me know if you're happy with the result.
Thanks!
| Martin Pitt (pitti) wrote : | # |
Thanks Evan, this workflow seems good to me. I have one more concern:
9 + path = os.path.
Can we still include the executable name and the uid here, for paranoia's sake? There might be race conditions about the user killing the process himself and then some other process starts and grabs the freed PID, or the hang is detected while the machine is shutting down (and thus all *.hanging files are invalid at the next boot). For addressing the latter I'd actually like to have the *.hanging files in /run/ somewhere, but a plausibility test in the UI (like comparing the .hanging mtime against /proc/uptime) would work fine as well.
Otherwise this looks fine to me, thanks!
- 2408. By Evan on 2012-07-03
-
Use the ExecutablePath and uid as part of the .hanging path, to avoid potential race conditions.
- 2409. By Evan on 2012-07-03
-
Silently discard .hanging files that were created before a reboot as the pid is no longer valid.
- 2410. By Evan on 2012-07-03
-
Make the dialog modal for the hanging application when one is provided.
- 2411. By Evan on 2012-07-03
-
Remove double import.
I've made the requested changes and added some extra code to make the apport dialog modal to the hanging application.
| Martin Pitt (pitti) wrote : | # |
Thanks, great work! We need to remember to add a gir1.2-wnck build and binary dependency to the packaging.
Please merge to trunk (if all tests succeed).
| Martin Pitt (pitti) wrote : | # |
Ah, when you merge, please don't forget to commit the merge together with an appropriate NEWS entry.


If compiz detects that it's hanging, it could just abort() and let the usual magic do its work. So I suppose this is primarily interesting for applications which need to ask the user "am I really broken"? Do we actually have an existing case where an application can detect that it is hanging? The only cases I've seen is that the kernel spits out some "process 12345 has not responded in 30 seconds" messages, but in this case it would probably be better to not rely on the application itself to call apport. I'd like to settle this question before we start adopting this rather complicated state machine.
Under the assumption that it indeed makes sense to have applications be able to self-detect that they are hung, but still able to call apport, this seems okay to me. One nitpick is that it would IMHO be better to kill them with SIGABRT.
Thanks for this!