Merge lp:~asac/ubuntu-test-cases/default-systemsettle-test into lp:ubuntu-test-cases/touch
- default-systemsettle-test
- Merge into touch
Status: | Merged |
---|---|
Merged at revision: | 10 |
Proposed branch: | lp:~asac/ubuntu-test-cases/default-systemsettle-test |
Merge into: | lp:ubuntu-test-cases/touch |
Diff against target: |
149 lines (+128/-0) 3 files modified
systemsettle/systemsettle.sh (+117/-0) systemsettle/tc_control (+10/-0) tslist.run (+1/-0) |
To merge this branch: | bzr merge lp:~asac/ubuntu-test-cases/default-systemsettle-test |
Related bugs: |
Reviewer | Review Type | Date Requested | Status |
---|---|---|---|
Gema Gomez | Pending | ||
Review via email: mp+180004@code.launchpad.net |
This proposal supersedes a proposal from 2013-08-13.
Commit message
Description of the change
be aware that the tc_control part is untested, while the script is. should be easy to adjust so please do during merge/commit at best.
addressed all previous comments by gema and doanac by:
16. By Alexander Sack 27 seconds ago
systemsettle: refactor pass/success exit code logic into trap handler
15. By Alexander Sack 4 minutes ago
systemsettle: improve tc_control action and expected_results wording
14. By Alexander Sack 6 minutes ago
systemsettle: add run-forever option for utah timeout support and improve toplog formatting
please merge :)
Paul Larson (pwlars) wrote : Posted in a previous version of this proposal | # |
Gema Gomez (gema) wrote : Posted in a previous version of this proposal | # |
The test case documentation needs to be somewhat explanatory of what the test case is trying to achieve, rather than talking about what script to run:
108 +action: |
109 + 1. run systemsettle.sh to wait for system to become idle
110 +expected_results: |
111 + 1. run systemsettle.sh succeeds
I was expecting something along the following lines:
action: |
1. Check the CPU load every minute for 10 minutes
expected_results: |
1. The load doesn't exceed X value
Whatever you are trying to actually do, I am not sure my description is accurate either, but you get the idea.
Alexander Sack (asac) wrote : Posted in a previous version of this proposal | # |
hi,
would be great if you could fix those nits while merging to your own needs.
On Tue, Aug 13, 2013 at 5:55 PM, Gema Gomez
<email address hidden> wrote:
> Review: Needs Fixing
>
> The test case documentation needs to be somewhat explanatory of what the test case is trying to achieve, rather than talking about what script to run:
> 108 +action: |
> 109 + 1. run systemsettle.sh to wait for system to become idle
> 110 +expected_results: |
> 111 + 1. run systemsettle.sh succeeds
>
> I was expecting something along the following lines:
> action: |
> 1. Check the CPU load every minute for 10 minutes
> expected_results: |
> 1. The load doesn't exceed X value
>
> Whatever you are trying to actually do, I am not sure my description is accurate either, but you get the idea.
>
> --
> https:/
> You are the owner of lp:~asac/ubuntu-test-cases/default-systemsettle-test.
Alexander Sack (asac) wrote : Posted in a previous version of this proposal | # |
the purpose of this is to have logic that will wait until the system
has calmed down (settled). It is supposed to be run a) as part of the
default suite and as discussed on IRC later also as a prereq before we
start individual test runs (autopilots, benchmarks, whatever).
the 12 minute timeout is tuned to be 2 minutes more than we expect the
run to take using the current defaults set in the script. we basically
give the system 10 minutes at max to settle for now. guess thats far
too long, so we could reduce it using trial error to something more
reasonable.
On Tue, Aug 13, 2013 at 5:42 PM, Paul Larson <email address hidden> wrote:
> Seems to run ok on my device under utah, I'm guessing the intent of this is to catch if we have a runaway process right? A couple of questions:
>
> +timeout: 720
> Any particular reason for 12 minutes timeout?
>
> 123 - test: vmstat
> 124 +- test: systemsettle
> 125 - test: netstat
> Any preference as to where it runs? You seem to have put it somewhere in the middle, but I wasn't sure if there was a reason for that.
> --
> https:/
> You are the owner of lp:~asac/ubuntu-test-cases/default-systemsettle-test.
Andy Doan (doanac) wrote : Posted in a previous version of this proposal | # |
Chris added this to his jenkins setup and it basically works:
http://
UTAH failed this test because it never settled (whoopsie was being bad). I see one issue I'd change:
57 +while test `calc $idle_avg '<' $idle_avg_min` = 1 -a "$settle_count" -lt "$settle_max"; do
We already run the test with a timeout of 12minutes so the "settle_max" check for the loop shouldn't be needed. However, it looks like settle_max got hit first instead of the timeout and then the pass/fail logic gets hit. I think you should:
1) remove settle_max logic
2) remove the logic at the very end that determines pass/fail into your cleanup function
Alexander Sack (asac) wrote : Posted in a previous version of this proposal | # |
feel free to do the changes that need to happen to land it. I did this code to give folks a head start to get insight into things like whoopsie case and more...
I dont really understand what you say also, so I really think it would be cool to just change what you suggest while merging.
Alexander Sack (asac) wrote : Posted in a previous version of this proposal | # |
oh on the settle_max thing i have no opinion. I just made the script so it makes sense if run without utah.
Alexander Sack (asac) wrote : Posted in a previous version of this proposal | # |
In the test run the console output looks very garbled...
in reality it dumbs a nice top so you see which process goes looping
Andy Doan (doanac) wrote : Posted in a previous version of this proposal | # |
On 08/13/2013 03:36 PM, Alexander Sack wrote:
> In the test run the console output looks very garbled...
>
> in reality it dumbs a nice top so you see which process goes looping
yeah. it also shows up fine in the UTAH yaml. don't worry about that
Alexander Sack (asac) wrote : Posted in a previous version of this proposal | # |
addressed stuff and resubmitted: https:/
- 16. By Alexander Sack
-
systemsettle: refactor pass/success exit code logic into trap handler
Alexander Sack (asac) wrote : | # |
fwiw, I repushed revision 16 a few times, i didnt see a new diff
coming through mail, so please check the web when reviewing for the
real, latest code.
On Tue, Aug 13, 2013 at 11:05 PM, Alexander Sack <email address hidden> wrote:
> Alexander Sack has proposed merging lp:~asac/ubuntu-test-cases/default-systemsettle-test into lp:ubuntu-test-cases/touch.
>
> Requested reviews:
> Gema Gomez (gema)
>
> For more details, see:
> https:/
>
> be aware that the tc_control part is untested, while the script is. should be easy to adjust so please do during merge/commit at best.
>
> addressed all previous comments by gema and doanac by:
>
> 16. By Alexander Sack 27 seconds ago
>
> systemsettle: refactor pass/success exit code logic into trap handler
>
> 15. By Alexander Sack 4 minutes ago
>
> systemsettle: improve tc_control action and expected_results wording
>
> 14. By Alexander Sack 6 minutes ago
>
> systemsettle: add run-forever option for utah timeout support and improve toplog formatting
>
>
> please merge :)
> --
> https:/
> You are the owner of lp:~asac/ubuntu-test-cases/default-systemsettle-test.
>
> === added directory 'systemsettle'
> === added file 'systemsettle/
> --- systemsettle/
> +++ systemsettle/
> @@ -0,0 +1,108 @@
> +#!/bin/bash
> +
> +calc () { awk "BEGIN{ print $* }" ;}
> +
> +cleanup () { rm -f $top_log $vmstat_log $vmstat_
> +
> +if test -z "$1"; then
> + echo "ERROR: you need to provide the average idle value"
> + echo "Usage: systemsettle.sh <avg-idle> [run-forever]"
> + echo " - e.g. systemsettle.sh 99.25"
> + echo " - e.g. systemsettle.sh 99.25 run-forever"
> + exit 129
> +fi
> +
> +if test "$2" = "run-forever"; then
> + settle_prefix='-'
> +fi
> +
> +# minimum average idle level required to succeed
> +idle_avg_min=$1
> +
> +# how many total attempts to settle the system
> +settle_max=1
> +
> +# measurement details: vmstat $vmstat_wait $vmstat_repeat
> +vmstat_wait=1
> +vmstat_repeat=10
> +
> +# how many samples to ignore
> +vmstat_ignore=1
> +
> +# exit code storage
> +exit_code=2
> +
> +# tweak cut field by arch
> +if uname -m | grep -q armv7; then
> + idle_pos=16
> +elif uname -m | grep -q i.86; then
> + idle_pos=15
> +else
> + echo "machine \'`uname -m`\' not supported"
> + exit 128
> +fi
> +
> +# set and calc more runtime values
> +vmstat_tail=`calc $vmstat_repeat - $vmstat_ignore`
> +settle_count=0
> +idle_avg=0
> +
> +echo "System Settle run - quiesce the system"
> +echo "------
> +echo
> +echo " + cmd: \'vmstat $vmstat_wait $vmstat_repeat\' ignoring first $vmstat_ignore (tail: $vmstat_tail)"
> +echo
> +
> +trap cleanup EXIT INT QUIT ILL KILL SEGV TERM
> +vmstat_log=`mktemp -t`
> +top_log=`mktemp -t`
> +
> +while test `calc $idle_avg '<' $idle_avg_min` = 1 -a "$settle_
> + echo Starting settle run $settle...
Andy Doan (doanac) wrote : | # |
this is close, but not quite right. The problem I see is related to the signal handling somehow. Ctrl-C works well, but if I run "kill <pid>" from another terminal, its really slow to repond and when it does - it doesn't exit the process. The problem is that when we run this in practice, UTAH is going to give it a sig-term when its timedout and then give a sig-kill. Given the sig-term will repond to slow, the process will just exit with no proper cleanup.
However, that might be okay since it will still exit with a bad return code and show the test as failed?
Alexander Sack (asac) wrote : | # |
no its not okay. we want the top report that it dumps in case of failure...
I believe my initial revision was on the spot :-P ...
SIGTERM takes a while, because it doesnt propagate down to vmstat ... you should give SIGTERM more time to finish (you always should if you hope for graceful shutdown anyway) or not run it with "run-forerver"
Alexander Sack (asac) wrote : | # |
OK, me looked up kill foo and found that in order to behave SIGTERM like the SIGINT from ctrl-c (propagate to whole process group) you would have to send kill with a negative PID: kill -TERM -1234
So yeah, you should fix it in utah and this is all good as it is ...
btw, ctrl-c sends SIGINT afaik...
Alexander Sack (asac) wrote : | # |
btw, i checked utah and in process.py you already try to kill all childrens manually as well ... so not sure if that code is buggy or if you didn't try the test in the real utah code ...
in anycase, I have pushed an inspirational branch that might work (not tested) that replaces that manual business with OS facilities ...
see: http://
Preview Diff
1 | === added directory 'systemsettle' |
2 | === added file 'systemsettle/systemsettle.sh' |
3 | --- systemsettle/systemsettle.sh 1970-01-01 00:00:00 +0000 |
4 | +++ systemsettle/systemsettle.sh 2013-08-13 21:47:11 +0000 |
5 | @@ -0,0 +1,117 @@ |
6 | +#!/bin/bash |
7 | + |
8 | +set -e |
9 | + |
10 | +# default exit code storage |
11 | +dump_error=1 |
12 | + |
13 | +calc () { awk "BEGIN{ print $* }" ;} |
14 | + |
15 | +cleanup () { |
16 | + if ! test "$dump_error" = 0; then |
17 | + echo "System failed to settle to target idle level ($idle_avg_min)" |
18 | + echo " + check out the following top log taken at each retry:" |
19 | + |
20 | + # dumb toplog indented |
21 | + while read line; do |
22 | + echo " $line" |
23 | + done < $top_log |
24 | + |
25 | + echo |
26 | + # dont rerun this logic in case we get multiple signals |
27 | + dump_error=0 |
28 | + fi |
29 | + rm -f $top_log $vmstat_log $vmstat_log.reduced |
30 | +} |
31 | + |
32 | +if test -z "$1"; then |
33 | + echo "ERROR: you need to provide the average idle value" |
34 | + echo "Usage: systemsettle.sh <avg-idle> [run-forever]" |
35 | + echo " - e.g. systemsettle.sh 99.25" |
36 | + echo " - e.g. systemsettle.sh 99.25 run-forever" |
37 | + exit 129 |
38 | +fi |
39 | + |
40 | +if test "$2" = "run-forever"; then |
41 | + settle_prefix='-' |
42 | +fi |
43 | + |
44 | +# minimum average idle level required to succeed |
45 | +idle_avg_min=$1 |
46 | + |
47 | +# how many total attempts to settle the system |
48 | +settle_max=10 |
49 | + |
50 | +# measurement details: vmstat $vmstat_wait $vmstat_repeat |
51 | +vmstat_wait=6 |
52 | +vmstat_repeat=10 |
53 | + |
54 | +# how many samples to ignore |
55 | +vmstat_ignore=1 |
56 | + |
57 | +# tweak cut field by arch |
58 | +if uname -m | grep -q armv7; then |
59 | + idle_pos=16 |
60 | +elif uname -m | grep -q i.86; then |
61 | + idle_pos=15 |
62 | +else |
63 | + echo "machine \'`uname -m`\' not supported" |
64 | + exit 128 |
65 | +fi |
66 | + |
67 | +# set and calc more runtime values |
68 | +vmstat_tail=`calc $vmstat_repeat - $vmstat_ignore` |
69 | +settle_count=0 |
70 | +idle_avg=0 |
71 | + |
72 | +echo "System Settle run - quiesce the system" |
73 | +echo "--------------------------------------" |
74 | +echo |
75 | +echo " + cmd: \'vmstat $vmstat_wait $vmstat_repeat\' ignoring first $vmstat_ignore (tail: $vmstat_tail)" |
76 | +echo |
77 | + |
78 | +trap cleanup EXIT INT QUIT ILL KILL SEGV TERM |
79 | +vmstat_log=`mktemp -t` |
80 | +top_log=`mktemp -t` |
81 | + |
82 | +while test `calc $idle_avg '<' $idle_avg_min` = 1 -a "$settle_prefix$settle_count" -lt "$settle_max"; do |
83 | + echo Starting settle run $settle_count: |
84 | + |
85 | + # get vmstat |
86 | + vmstat $vmstat_wait $vmstat_repeat | tee $vmstat_log |
87 | + cat $vmstat_log | tail -n $vmstat_tail > $vmstat_log.reduced |
88 | + |
89 | + # log top output for potential debugging |
90 | + echo "TOP DUMP (after settle run: $settle_count)" >> $top_log |
91 | + echo "========================" >> $top_log |
92 | + top -n 1 -b >> $top_log |
93 | + echo >> $top_log |
94 | + |
95 | + # calc average of idle field for this measurement |
96 | + sum=0 |
97 | + count=0 |
98 | + while read line; do |
99 | + idle=`echo $line | sed -e 's/\s\s*/ /g' | cut -d ' ' -f 15` |
100 | + sum=`calc $sum + $idle` |
101 | + count=`calc $count + 1` |
102 | + done < $vmstat_log.reduced |
103 | + |
104 | + idle_avg=`calc $sum.0 / $count.0` |
105 | + settle_count=`calc $settle_count + 1` |
106 | + |
107 | + echo |
108 | + echo "Measurement:" |
109 | + echo " + idle level: $idle_avg" |
110 | + echo " + idle sum: $sum / count: $count" |
111 | + echo |
112 | +done |
113 | + |
114 | +if test `calc $idle_avg '<' $idle_avg_min` = 1; then |
115 | + echo "system not settled. FAIL" |
116 | + exit 1 |
117 | +else |
118 | + echo "system settled. SUCCESS" |
119 | + dump_error=0 |
120 | + exit 0 |
121 | +fi |
122 | + |
123 | |
124 | === added file 'systemsettle/tc_control' |
125 | --- systemsettle/tc_control 1970-01-01 00:00:00 +0000 |
126 | +++ systemsettle/tc_control 2013-08-13 21:47:11 +0000 |
127 | @@ -0,0 +1,10 @@ |
128 | +description: check if system settles to idle average > 99.25% |
129 | +dependencies: none |
130 | +action: | |
131 | + 1. Take CPU load samples for 10 minutes and fail if average idle never goes above 99.25% percent |
132 | +expected_results: | |
133 | + 1. When doing nothing, system calms down to at least 99.25% idle level |
134 | +type: userland |
135 | +timeout: 720 |
136 | +command: ./systemsettle.sh 99.25 run-forever |
137 | +run_as: root |
138 | |
139 | === modified file 'tslist.run' |
140 | --- tslist.run 2013-06-17 20:59:34 +0000 |
141 | +++ tslist.run 2013-08-13 21:47:11 +0000 |
142 | @@ -1,6 +1,7 @@ |
143 | - test: pwd |
144 | - test: uname |
145 | - test: vmstat |
146 | +- test: systemsettle |
147 | - test: netstat |
148 | - test: ifconfig |
149 | - test: route |
Seems to run ok on my device under utah, I'm guessing the intent of this is to catch if we have a runaway process right? A couple of questions:
+timeout: 720
Any particular reason for 12 minutes timeout?
123 - test: vmstat
124 +- test: systemsettle
125 - test: netstat
Any preference as to where it runs? You seem to have put it somewhere in the middle, but I wasn't sure if there was a reason for that.