Merge lp:~asac/ubuntu-test-cases/default-systemsettle-test into lp:ubuntu-test-cases/touch

Proposed by Alexander Sack
Status: Merged
Merged at revision: 10
Proposed branch: lp:~asac/ubuntu-test-cases/default-systemsettle-test
Merge into: lp:ubuntu-test-cases/touch
Diff against target: 149 lines (+128/-0)
3 files modified
systemsettle/systemsettle.sh (+117/-0)
systemsettle/tc_control (+10/-0)
tslist.run (+1/-0)
To merge this branch: bzr merge lp:~asac/ubuntu-test-cases/default-systemsettle-test
Reviewer Review Type Date Requested Status
Gema Gomez Pending
Review via email: mp+180004@code.launchpad.net

This proposal supersedes a proposal from 2013-08-13.

Description of the change

be aware that the tc_control part is untested, while the script is. should be easy to adjust so please do during merge/commit at best.

addressed all previous comments by gema and doanac by:

 16. By Alexander Sack 27 seconds ago

    systemsettle: refactor pass/success exit code logic into trap handler

15. By Alexander Sack 4 minutes ago

    systemsettle: improve tc_control action and expected_results wording

14. By Alexander Sack 6 minutes ago

    systemsettle: add run-forever option for utah timeout support and improve toplog formatting

please merge :)

To post a comment you must log in.
Revision history for this message
Paul Larson (pwlars) wrote : Posted in a previous version of this proposal

Seems to run ok on my device under utah, I'm guessing the intent of this is to catch if we have a runaway process right? A couple of questions:

+timeout: 720
Any particular reason for 12 minutes timeout?

123 - test: vmstat
124 +- test: systemsettle
125 - test: netstat
Any preference as to where it runs? You seem to have put it somewhere in the middle, but I wasn't sure if there was a reason for that.

Revision history for this message
Gema Gomez (gema) wrote : Posted in a previous version of this proposal

The test case documentation needs to be somewhat explanatory of what the test case is trying to achieve, rather than talking about what script to run:
108 +action: |
109 + 1. run systemsettle.sh to wait for system to become idle
110 +expected_results: |
111 + 1. run systemsettle.sh succeeds

I was expecting something along the following lines:
action: |
1. Check the CPU load every minute for 10 minutes
expected_results: |
1. The load doesn't exceed X value

Whatever you are trying to actually do, I am not sure my description is accurate either, but you get the idea.

review: Needs Fixing
Revision history for this message
Alexander Sack (asac) wrote : Posted in a previous version of this proposal

hi,

would be great if you could fix those nits while merging to your own needs.

On Tue, Aug 13, 2013 at 5:55 PM, Gema Gomez
<email address hidden> wrote:
> Review: Needs Fixing
>
> The test case documentation needs to be somewhat explanatory of what the test case is trying to achieve, rather than talking about what script to run:
> 108 +action: |
> 109 + 1. run systemsettle.sh to wait for system to become idle
> 110 +expected_results: |
> 111 + 1. run systemsettle.sh succeeds
>
> I was expecting something along the following lines:
> action: |
> 1. Check the CPU load every minute for 10 minutes
> expected_results: |
> 1. The load doesn't exceed X value
>
> Whatever you are trying to actually do, I am not sure my description is accurate either, but you get the idea.
>
> --
> https://code.launchpad.net/~asac/ubuntu-test-cases/default-systemsettle-test/+merge/179916
> You are the owner of lp:~asac/ubuntu-test-cases/default-systemsettle-test.

Revision history for this message
Alexander Sack (asac) wrote : Posted in a previous version of this proposal

the purpose of this is to have logic that will wait until the system
has calmed down (settled). It is supposed to be run a) as part of the
default suite and as discussed on IRC later also as a prereq before we
start individual test runs (autopilots, benchmarks, whatever).

the 12 minute timeout is tuned to be 2 minutes more than we expect the
run to take using the current defaults set in the script. we basically
give the system 10 minutes at max to settle for now. guess thats far
too long, so we could reduce it using trial error to something more
reasonable.

On Tue, Aug 13, 2013 at 5:42 PM, Paul Larson <email address hidden> wrote:
> Seems to run ok on my device under utah, I'm guessing the intent of this is to catch if we have a runaway process right? A couple of questions:
>
> +timeout: 720
> Any particular reason for 12 minutes timeout?
>
> 123 - test: vmstat
> 124 +- test: systemsettle
> 125 - test: netstat
> Any preference as to where it runs? You seem to have put it somewhere in the middle, but I wasn't sure if there was a reason for that.
> --
> https://code.launchpad.net/~asac/ubuntu-test-cases/default-systemsettle-test/+merge/179916
> You are the owner of lp:~asac/ubuntu-test-cases/default-systemsettle-test.

Revision history for this message
Andy Doan (doanac) wrote : Posted in a previous version of this proposal

Chris added this to his jenkins setup and it basically works:

 http://142.197.155.43:8080/view/settle/job/settle-saucy-touch-mako-smoke-default/2/console

UTAH failed this test because it never settled (whoopsie was being bad). I see one issue I'd change:

57 +while test `calc $idle_avg '<' $idle_avg_min` = 1 -a "$settle_count" -lt "$settle_max"; do

We already run the test with a timeout of 12minutes so the "settle_max" check for the loop shouldn't be needed. However, it looks like settle_max got hit first instead of the timeout and then the pass/fail logic gets hit. I think you should:

1) remove settle_max logic
2) remove the logic at the very end that determines pass/fail into your cleanup function

Revision history for this message
Alexander Sack (asac) wrote : Posted in a previous version of this proposal

feel free to do the changes that need to happen to land it. I did this code to give folks a head start to get insight into things like whoopsie case and more...

I dont really understand what you say also, so I really think it would be cool to just change what you suggest while merging.

Revision history for this message
Alexander Sack (asac) wrote : Posted in a previous version of this proposal

oh on the settle_max thing i have no opinion. I just made the script so it makes sense if run without utah.

Revision history for this message
Alexander Sack (asac) wrote : Posted in a previous version of this proposal

In the test run the console output looks very garbled...

in reality it dumbs a nice top so you see which process goes looping

Revision history for this message
Andy Doan (doanac) wrote : Posted in a previous version of this proposal

On 08/13/2013 03:36 PM, Alexander Sack wrote:
> In the test run the console output looks very garbled...
>
> in reality it dumbs a nice top so you see which process goes looping

yeah. it also shows up fine in the UTAH yaml. don't worry about that

Revision history for this message
Alexander Sack (asac) wrote : Posted in a previous version of this proposal
16. By Alexander Sack

systemsettle: refactor pass/success exit code logic into trap handler

Revision history for this message
Alexander Sack (asac) wrote :
Download full text (5.3 KiB)

fwiw, I repushed revision 16 a few times, i didnt see a new diff
coming through mail, so please check the web when reviewing for the
real, latest code.

On Tue, Aug 13, 2013 at 11:05 PM, Alexander Sack <email address hidden> wrote:
> Alexander Sack has proposed merging lp:~asac/ubuntu-test-cases/default-systemsettle-test into lp:ubuntu-test-cases/touch.
>
> Requested reviews:
> Gema Gomez (gema)
>
> For more details, see:
> https://code.launchpad.net/~asac/ubuntu-test-cases/default-systemsettle-test/+merge/180004
>
> be aware that the tc_control part is untested, while the script is. should be easy to adjust so please do during merge/commit at best.
>
> addressed all previous comments by gema and doanac by:
>
> 16. By Alexander Sack 27 seconds ago
>
> systemsettle: refactor pass/success exit code logic into trap handler
>
> 15. By Alexander Sack 4 minutes ago
>
> systemsettle: improve tc_control action and expected_results wording
>
> 14. By Alexander Sack 6 minutes ago
>
> systemsettle: add run-forever option for utah timeout support and improve toplog formatting
>
>
> please merge :)
> --
> https://code.launchpad.net/~asac/ubuntu-test-cases/default-systemsettle-test/+merge/180004
> You are the owner of lp:~asac/ubuntu-test-cases/default-systemsettle-test.
>
> === added directory 'systemsettle'
> === added file 'systemsettle/systemsettle.sh'
> --- systemsettle/systemsettle.sh 1970-01-01 00:00:00 +0000
> +++ systemsettle/systemsettle.sh 2013-08-13 21:04:16 +0000
> @@ -0,0 +1,108 @@
> +#!/bin/bash
> +
> +calc () { awk "BEGIN{ print $* }" ;}
> +
> +cleanup () { rm -f $top_log $vmstat_log $vmstat_log.reduced; exit $exit_code;}
> +
> +if test -z "$1"; then
> + echo "ERROR: you need to provide the average idle value"
> + echo "Usage: systemsettle.sh <avg-idle> [run-forever]"
> + echo " - e.g. systemsettle.sh 99.25"
> + echo " - e.g. systemsettle.sh 99.25 run-forever"
> + exit 129
> +fi
> +
> +if test "$2" = "run-forever"; then
> + settle_prefix='-'
> +fi
> +
> +# minimum average idle level required to succeed
> +idle_avg_min=$1
> +
> +# how many total attempts to settle the system
> +settle_max=1
> +
> +# measurement details: vmstat $vmstat_wait $vmstat_repeat
> +vmstat_wait=1
> +vmstat_repeat=10
> +
> +# how many samples to ignore
> +vmstat_ignore=1
> +
> +# exit code storage
> +exit_code=2
> +
> +# tweak cut field by arch
> +if uname -m | grep -q armv7; then
> + idle_pos=16
> +elif uname -m | grep -q i.86; then
> + idle_pos=15
> +else
> + echo "machine \'`uname -m`\' not supported"
> + exit 128
> +fi
> +
> +# set and calc more runtime values
> +vmstat_tail=`calc $vmstat_repeat - $vmstat_ignore`
> +settle_count=0
> +idle_avg=0
> +
> +echo "System Settle run - quiesce the system"
> +echo "--------------------------------------"
> +echo
> +echo " + cmd: \'vmstat $vmstat_wait $vmstat_repeat\' ignoring first $vmstat_ignore (tail: $vmstat_tail)"
> +echo
> +
> +trap cleanup EXIT INT QUIT ILL KILL SEGV TERM
> +vmstat_log=`mktemp -t`
> +top_log=`mktemp -t`
> +
> +while test `calc $idle_avg '<' $idle_avg_min` = 1 -a "$settle_prefix$settle_count" -lt "$settle_max"; do
> + echo Starting settle run $settle...

Read more...

Revision history for this message
Andy Doan (doanac) wrote :

this is close, but not quite right. The problem I see is related to the signal handling somehow. Ctrl-C works well, but if I run "kill <pid>" from another terminal, its really slow to repond and when it does - it doesn't exit the process. The problem is that when we run this in practice, UTAH is going to give it a sig-term when its timedout and then give a sig-kill. Given the sig-term will repond to slow, the process will just exit with no proper cleanup.

However, that might be okay since it will still exit with a bad return code and show the test as failed?

Revision history for this message
Alexander Sack (asac) wrote :

no its not okay. we want the top report that it dumps in case of failure...

I believe my initial revision was on the spot :-P ...

SIGTERM takes a while, because it doesnt propagate down to vmstat ... you should give SIGTERM more time to finish (you always should if you hope for graceful shutdown anyway) or not run it with "run-forerver"

Revision history for this message
Alexander Sack (asac) wrote :

OK, me looked up kill foo and found that in order to behave SIGTERM like the SIGINT from ctrl-c (propagate to whole process group) you would have to send kill with a negative PID: kill -TERM -1234

So yeah, you should fix it in utah and this is all good as it is ...

btw, ctrl-c sends SIGINT afaik...

Revision history for this message
Alexander Sack (asac) wrote :

btw, i checked utah and in process.py you already try to kill all childrens manually as well ... so not sure if that code is buggy or if you didn't try the test in the real utah code ...

in anycase, I have pushed an inspirational branch that might work (not tested) that replaces that manual business with OS facilities ...

see: http://bazaar.launchpad.net/~asac/utah/use-os_killpg/revision/996

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== added directory 'systemsettle'
2=== added file 'systemsettle/systemsettle.sh'
3--- systemsettle/systemsettle.sh 1970-01-01 00:00:00 +0000
4+++ systemsettle/systemsettle.sh 2013-08-13 21:47:11 +0000
5@@ -0,0 +1,117 @@
6+#!/bin/bash
7+
8+set -e
9+
10+# default exit code storage
11+dump_error=1
12+
13+calc () { awk "BEGIN{ print $* }" ;}
14+
15+cleanup () {
16+ if ! test "$dump_error" = 0; then
17+ echo "System failed to settle to target idle level ($idle_avg_min)"
18+ echo " + check out the following top log taken at each retry:"
19+
20+ # dumb toplog indented
21+ while read line; do
22+ echo " $line"
23+ done < $top_log
24+
25+ echo
26+ # dont rerun this logic in case we get multiple signals
27+ dump_error=0
28+ fi
29+ rm -f $top_log $vmstat_log $vmstat_log.reduced
30+}
31+
32+if test -z "$1"; then
33+ echo "ERROR: you need to provide the average idle value"
34+ echo "Usage: systemsettle.sh <avg-idle> [run-forever]"
35+ echo " - e.g. systemsettle.sh 99.25"
36+ echo " - e.g. systemsettle.sh 99.25 run-forever"
37+ exit 129
38+fi
39+
40+if test "$2" = "run-forever"; then
41+ settle_prefix='-'
42+fi
43+
44+# minimum average idle level required to succeed
45+idle_avg_min=$1
46+
47+# how many total attempts to settle the system
48+settle_max=10
49+
50+# measurement details: vmstat $vmstat_wait $vmstat_repeat
51+vmstat_wait=6
52+vmstat_repeat=10
53+
54+# how many samples to ignore
55+vmstat_ignore=1
56+
57+# tweak cut field by arch
58+if uname -m | grep -q armv7; then
59+ idle_pos=16
60+elif uname -m | grep -q i.86; then
61+ idle_pos=15
62+else
63+ echo "machine \'`uname -m`\' not supported"
64+ exit 128
65+fi
66+
67+# set and calc more runtime values
68+vmstat_tail=`calc $vmstat_repeat - $vmstat_ignore`
69+settle_count=0
70+idle_avg=0
71+
72+echo "System Settle run - quiesce the system"
73+echo "--------------------------------------"
74+echo
75+echo " + cmd: \'vmstat $vmstat_wait $vmstat_repeat\' ignoring first $vmstat_ignore (tail: $vmstat_tail)"
76+echo
77+
78+trap cleanup EXIT INT QUIT ILL KILL SEGV TERM
79+vmstat_log=`mktemp -t`
80+top_log=`mktemp -t`
81+
82+while test `calc $idle_avg '<' $idle_avg_min` = 1 -a "$settle_prefix$settle_count" -lt "$settle_max"; do
83+ echo Starting settle run $settle_count:
84+
85+ # get vmstat
86+ vmstat $vmstat_wait $vmstat_repeat | tee $vmstat_log
87+ cat $vmstat_log | tail -n $vmstat_tail > $vmstat_log.reduced
88+
89+ # log top output for potential debugging
90+ echo "TOP DUMP (after settle run: $settle_count)" >> $top_log
91+ echo "========================" >> $top_log
92+ top -n 1 -b >> $top_log
93+ echo >> $top_log
94+
95+ # calc average of idle field for this measurement
96+ sum=0
97+ count=0
98+ while read line; do
99+ idle=`echo $line | sed -e 's/\s\s*/ /g' | cut -d ' ' -f 15`
100+ sum=`calc $sum + $idle`
101+ count=`calc $count + 1`
102+ done < $vmstat_log.reduced
103+
104+ idle_avg=`calc $sum.0 / $count.0`
105+ settle_count=`calc $settle_count + 1`
106+
107+ echo
108+ echo "Measurement:"
109+ echo " + idle level: $idle_avg"
110+ echo " + idle sum: $sum / count: $count"
111+ echo
112+done
113+
114+if test `calc $idle_avg '<' $idle_avg_min` = 1; then
115+ echo "system not settled. FAIL"
116+ exit 1
117+else
118+ echo "system settled. SUCCESS"
119+ dump_error=0
120+ exit 0
121+fi
122+
123
124=== added file 'systemsettle/tc_control'
125--- systemsettle/tc_control 1970-01-01 00:00:00 +0000
126+++ systemsettle/tc_control 2013-08-13 21:47:11 +0000
127@@ -0,0 +1,10 @@
128+description: check if system settles to idle average > 99.25%
129+dependencies: none
130+action: |
131+ 1. Take CPU load samples for 10 minutes and fail if average idle never goes above 99.25% percent
132+expected_results: |
133+ 1. When doing nothing, system calms down to at least 99.25% idle level
134+type: userland
135+timeout: 720
136+command: ./systemsettle.sh 99.25 run-forever
137+run_as: root
138
139=== modified file 'tslist.run'
140--- tslist.run 2013-06-17 20:59:34 +0000
141+++ tslist.run 2013-08-13 21:47:11 +0000
142@@ -1,6 +1,7 @@
143 - test: pwd
144 - test: uname
145 - test: vmstat
146+- test: systemsettle
147 - test: netstat
148 - test: ifconfig
149 - test: route

Subscribers

People subscribed via source and target branches