Please add a "time since last activity" clock to the running bar of updating output

Bug #874270 reported by Paul Sokolovsky
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
LAVA Scheduler (deprecated)
Won't Fix
Wishlist
Unassigned

Bug Description

http://validation.linaro.org/lava-server/scheduler/job/2583 - this one finished with several timeout error messages, taking 3-4 hours to finish. Conclusions: timeouts there are two big.

http://validation.linaro.org/lava-server/scheduler/job/2616 - this one appears to be hanging, but in real-time now. It would be nice some means of timeness indicatation in log - for example, add log (realworld) timestamps before important events like reboots. Or, taking into account there's there's already AJAX, etc. advancety there, maybe it could show "last log output received 5m 2s ago".

Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

P.S. Last output from 2nd job I see is:

[ 3.859863] init: cannot find '/system/bin/faketsd', disabling 'faketsd'

[ 3.866973] init: cannot find '/system/bin/uim', disabling 'uim'

[ 3.873321] init: cannot find '/system/bin/gatord', disabling 'gatord'

[ 3.902923] adb_open

[ 3.995422] omap_device: omap_hsmmc.0: new worst case deactivate latency 0: 30517

Revision history for this message
Paul Larson (pwlars) wrote :

Hmm, the default timeout is one hour, and the part that waits to see the images boot is using the default timeout, so I'm not sure why it should have taken 3-4 hours. Certainly this should probably be tweaked in the dispatcher - I will open a separate dispatcher bug for that. There are several things like that, but getting the timing just right is tricky so we tend to be generous more often than not. For this bug I'd like to focus around the feature request of adding a wait time to the running bar.

Changed in lava-scheduler:
importance: Undecided → Wishlist
summary: - Jobs hang frequently, w/o indication and for long time
+ Please add a "time since last activity" clock to the running bar of
+ updating output
Changed in lava-scheduler:
milestone: none → backlog
status: New → Triaged
Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

I saw around 3 timeout errors, so that might explain 3 hours. ("3 or 4" is due to lp:861149).

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

This should be fairly simple as I think we can just use the mtime of the log file to determine when the last output was (should test a bit to check this is accurate enough of course) and just include the "time since last output" as another header in the ajax response sent to the output tab.

Revision history for this message
Yongqin Liu (liuyq0307) wrote :

For this problem, the process did as below:
1. boot_linaro_android_image action was successfull

2. executed action lava_android_test_install
     a. use wait_home_screen method as one step to check android status
         this step spent 1 hour, use the default timeout value 3600s

3. executed action lava_android_test_run 0xbench
     a. use wait_home_screen method as one step to check android status
         this step spent 1 hour, use the default timeout value 3600s

4. executed action lava_android_test_run monkey
     a. use wait_home_screen method as one step to check android status
         this step spent 1 hour, use the default timeout value 3600s

So all the time spent on timeout is 3hours.

within method wait_home_screen of android_client.py,
it uses [self.proc.expect('stopped')] to check the android status, because the timeout is not specified here, so it use the timeout of proc(is set to 3600s).
for this problem, specified the timeout to 2s or specified the timeout to None is ok I think.

And there was a android bug: after 2 hours, the value of "getprop init.svc.bootanim" still was "running".
I report it as https://bugs.launchpad.net/linaro-android/+bug/880810

Revision history for this message
Yongqin Liu (liuyq0307) wrote :

for job http://validation.linaro.org/lava-server/scheduler/job/2616, it took 1 hour.
because the boot_linaro_android_image action in file boot_control.py calls boot_linaro_android_image method in file android_client.py.
and the second boot_linaro_android_image action use self.proc.expect("#") to check the android boot status.
for this point(for this is the boot process), maybe the 1 hour timeout is a little acceptable.

Revision history for this message
Spring Zhang (qzhang) wrote :

In job 2616, there may be a timeout parameter lack in boot_linaro_image(), and in job 2583, it is obviously caused by no timeout for match_id = self.proc.expect('stopped') in wait_home_screen, we need to fix such losing. Thanks Yongqin's analysis

Fathi Boudra (fboudra)
Changed in lava-scheduler:
milestone: backlog → none
Alan Bennett (akbennett)
Changed in lava-scheduler:
status: Triaged → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.