Merge lp:~martin-lp/hipl/n900-build-fix into lp:hipl

Proposed by David Martin
Status: Merged
Approved by: René Hummen
Approved revision: 6083
Merged at revision: 6084
Proposed branch: lp:~martin-lp/hipl/n900-build-fix
Merge into: lp:hipl
Diff against target: 14 lines (+4/-0)
1 file modified
test/lib/core/hostid.c (+4/-0)
To merge this branch: bzr merge lp:~martin-lp/hipl/n900-build-fix
Reviewer Review Type Date Requested Status
René Hummen Approve
Diego Biurrun Approve
Christof Mroz Approve
Review via email: mp+78551@code.launchpad.net

This proposal supersedes a proposal from 2011-09-22.

Description of the change

Our check framework has a default timeout of 4 seconds for tests
which is not sufficient for key generation in the scratchbox
environment or on the N900. As a result the unit test and package
build fails there.
This branch sets the timeout to 120 seconds which seems to be
sufficient for current tests.

Failures while building the N900 package or running the unit tests
on the respective environment look like this:

debug(test/lib/core/hostid.c:74@test_serialize_deserialize_rsa_: Trying to serialize and deserialize some RSA keys.
debug(lib/core/crypto.c:768@create_rsa_key): *****************Creating RSA of 1024 bits
debug(lib/core/crypto.c:768@create_rsa_key): *****************Creating RSA of 2048 bits
Killed

With the changes in this branch it should not happen anymore.

Purpose of the review:
All good now? Everybody happy with how it's solved?
It's cleaner than a #define for sure. :)

To post a comment you must log in.
Revision history for this message
Miika Komu (miika-iki) wrote : Posted in a previous version of this proposal

Please show the "backtrace" of gdb.

Revision history for this message
Stefan Götz (stefan.goetz-deactivatedaccount) wrote : Posted in a previous version of this proposal
Revision history for this message
René Hummen (rene-hummen) wrote : Posted in a previous version of this proposal

I also get abort with 1024 bit keys. David can you please post some more details on what you already checked.

Revision history for this message
David Martin (martin-lp) wrote : Posted in a previous version of this proposal

Hi,

On Thu, Sep 22, 2011 at 5:35 PM, Diego Biurrun <email address hidden> wrote:
> review reject
>
> I think you should really give this another try and dig deeper. Find out
> which function is failing with some printfs, debug some more.

like I said it's RSA_generate_key() from the OpenSSL library which fails.

> You could also try to update the check framework in scratchbox and/or OpenSSL.

This may be worth a try, I'll have a look.

>> --- test/lib/core/hostid.c 2011-09-05 08:46:53 +0000
>> +++ test/lib/core/hostid.c 2011-09-22 11:59:39 +0000
>> @@ -64,10 +64,20 @@
>> +/*
>> + * create_rsa_key() fails for key sizes greater than 1024 on both the
>> + * N900 and the respective scratchbox environment.
>> + * Use HAVE_TCASE_ADD_EXIT_TEST as a kludge as it is not defined for
>> + * the version of check on the scratchbox environment.
>> + */
>> +#ifdef HAVE_TCASE_ADD_EXIT_TEST
>> + int bits[3] = { 1024, 2048, 3072 };
>> +#else
>> + int bits[1] = { 1024 };
>> +#endif
>> + RSA *key = NULL, *key_deserialized = NULL;
>> + EVP_PKEY *key_a = NULL, *key_b = NULL;
>> unsigned char *keyrr;
>> struct hip_host_id_priv hostid;
>
> Not a pretty sight ;)

It wasn't supposed to win any beauty contests and that's why I ask
for a better way to deal with it. :p

On Thu, Sep 22, 2011 at 7:14 PM, Miika Komu <email address hidden> wrote:
> Please show the "backtrace" of gdb.

There's no real backtrace as it is openssl's function that fails.
Nevertheless here is the output:

> debug(test/lib/core/crypto.c:242@test_invalid_impl_ecdsa_sign_v: Successfully passed test on lowlevel sign, verify operations with invalid inputs.
> lib/core/hostid
> debug(test/lib/core/hostid.c:74@test_serialize_deserialize_rsa_: Trying to serialize and deserialize some RSA keys.
> debug(lib/core/crypto.c:768@create_rsa_key): *****************Creating RSA of 1024 bits
>
>
> debug(lib/core/crypto.c:769@create_rsa_key): before generating key
> debug(lib/core/crypto.c:771@create_rsa_key): after generating key
> debug(lib/core/crypto.c:768@create_rsa_key): *****************Creating RSA of 2048 bits
>
>
> debug(lib/core/crypto.c:769@create_rsa_key): before generating key
>
> Program terminated with signal SIGKILL, Killed.
> The program no longer exists.
> (gdb) bt
> No stack.

(before and after generating key are debug prints added by me in the lines before
and after the RSA_generate_key() call in /lib/core/crypto.c create_rsa_key())

On Fri, Sep 23, 2011 at 7:52 PM, Stefan Götz <email address hidden> wrote:
> Maybe https://bugs.launchpad.net/hipl/+bug/788799 plays into this?

I doubt it. :) It does not seem to be a problem with our code but with
OpenSSL or maybe our use of OpenSSL.

Revision history for this message
David Martin (martin-lp) wrote : Posted in a previous version of this proposal

Hi,

On Mon, Sep 26, 2011 at 4:32 PM, Christof Mroz <email address hidden> wrote:
> On Mon, 26 Sep 2011 15:56:24 +0200, David Martin
> <email address hidden> wrote:
>
>>> Program terminated with signal SIGKILL, Killed.
>>> The program no longer exists.
>>> (gdb) bt
>>> No stack.
>
> Can you dump the rlimits inside the test? See getrlimit(2).

Not sure but I'll have a look at it.

> If all else fails, you may try building OpenSSL yourself with debug symbols
> so you can circumscript the culprit even further (or learn ASM :) ). Also,
> did you try strace yet?

Nope, didn't try strace yet. Thanks for the hint. Building OpenSSL is my current
plan as our version of OpenSSL already is the most recent from the Maemo repos.

Revision history for this message
Diego Biurrun (diego-biurrun) wrote : Posted in a previous version of this proposal

On Mon, Sep 26, 2011 at 01:56:24PM +0000, David Martin wrote:
>
> On Thu, Sep 22, 2011 at 5:35 PM, Diego Biurrun <email address hidden> wrote:
> > review reject
> >
> > I think you should really give this another try and dig deeper. Find out
> > which function is failing with some printfs, debug some more.
>
> like I said it's RSA_generate_key() from the OpenSSL library which fails.

As a next step, print out the values that we pass to that function, maybe
some are invalid, NULL or whatever. If pointers are passed, dereference
them and print their values, maybe they point at invalid values.

Diego

Revision history for this message
David Martin (martin-lp) wrote : Posted in a previous version of this proposal

Hi,
did some more research on this and did actually find something.

On Tue, Sep 27, 2011 at 9:51 AM, Diego Biurrun <email address hidden> wrote:
> On Mon, Sep 26, 2011 at 01:56:24PM +0000, David Martin wrote:
>>
>> On Thu, Sep 22, 2011 at 5:35 PM, Diego Biurrun <email address hidden> wrote:
>> > review reject
>> >
>> > I think you should really give this another try and dig deeper. Find out
>> > which function is failing with some printfs, debug some more.
>>
>> like I said it's RSA_generate_key() from the OpenSSL library which fails.
>
> As a next step, print out the values that we pass to that function, maybe
> some are invalid, NULL or whatever. If pointers are passed, dereference
> them and print their values, maybe they point at invalid values.

We don't pass any pointers (well, NULL) and our values are fine. :)

On Mon, Sep 26, 2011 at 4:32 PM, Christof Mroz <email address hidden> wrote:
> Can you dump the rlimits inside the test? See getrlimit(2).

Any specific rlimit you had in mind? RLIMIT_CPU for cpu time does not seem to be
set and limiting it by hand (eg. to a second and choosing a stronger key) results
in a graceful exit with the respective signal (cpu time exceeded).

On Mon, Sep 26, 2011 at 4:32 PM, Christof Mroz <email address hidden> wrote:
> If all else fails, you may try building OpenSSL yourself with debug symbols
> so you can circumscript the culprit even further (or learn ASM :) ). Also,
> did you try strace yet?

Did not try the newer OpenSSL yet but strace is neat and helped to narrow down
the problem.

Both the N900 and the version compiled on passion set a SIGALRM with rt_sigaction
(which I think is responsible for killing it).

The N900 binary in addition sets a timer to 3 seconds with setitimer. This one
sends the SIGALRM when the timer reaches zero and this is what happens before
it breaks.

I'm not sure yet who is responsible for setting the timer. Is this platform
specific? I think Android does something like this where a process gets killed
if it's unresponsive for more than a few seconds.

Should I upload the traces? I don't see a button for this in launchpad (I'm pretty
sure to have done that before, though).

Revision history for this message
David Martin (martin-lp) wrote : Posted in a previous version of this proposal

Hi,
seems like our problem was with the check framework which sets a default timeout
of 4 seconds for tests. I'll test which limit works and request another review
of the merge request.

On Wed, Oct 5, 2011 at 6:06 PM, Christof Mroz <email address hidden> wrote:
> On Wed, 05 Oct 2011 16:44:24 +0200, David Martin
> <email address hidden> wrote:
>
>> I'm not sure yet who is responsible for setting the timer.
>
> Then set a BP on setitimer() in gdb, instruct it to BT and CONT on each hit.
> This could give you a rough idea who set the timer (even more useful with a
> debug build of openssl, of course).
>
> You can also try to ignore the timer using LD_PRELOAD and a crafted SO.
> IIRC, just create a C file with your fake setitimer() implementation
> (signature must match, adding a debug print statement is a good idea),
> compile with -fpic and -shared and preload the resulting binary to see if it
> keeps getting killed.

Dang, you come up with the craziest debugging ideas! ;) Think that's not
necessary anymore.

>> Is this platform specific? I think Android does something like this where
>> a process gets killed if it's unresponsive for more than a few seconds.
>
> That's a very good idea, did you try adding a sleep(INT_MAX) to confirm
> this?

Using sleep was a good idea. It's not platform specific but the check framework
which has timeouts for the tests. If you force it to run longer on passion it
exits gracefully with a proper message:

> test/lib/core/hostid.c:39:E:Core:test_serialize_deserialize_rsa_keys:0: (after this point) Test timeout expired

Seems like the check version on the scratchbox environment is not that convenient
and simply breaks.

Revision history for this message
Diego Biurrun (diego-biurrun) wrote :

 review needs-info

On Fri, Oct 07, 2011 at 09:20:31AM +0000, David Martin wrote:
>
> --- test/lib/core/hostid.c 2011-09-05 08:46:53 +0000
> +++ test/lib/core/hostid.c 2011-10-07 09:19:23 +0000
> @@ -65,7 +65,7 @@
> START_TEST(test_serialize_deserialize_rsa_keys)
> {
> unsigned int i, keyrr_len = 0;
> - int bits[3] = {1024, 2048, 3072};
> + int bits[3] = { 1024, 2048, 3072 };
> RSA *key = NULL, *key_deserialized = NULL;

This you can push anytime.

> @@ -126,6 +126,11 @@
>
> TCase *tc_core = tcase_create("Core");
> + // the default test timeout of 4 seconds is too short,
> + // generating keys in scratchbox or on the N900 takes
> + // a while

nit: no need to break the second line and /* */ comments are preferred

> + tcase_set_timeout(tc_core, 120);

I wonder if we should not set this in a slightly more global place.

Diego

review: Needs Information
Revision history for this message
David Martin (martin-lp) wrote :

Hi,

On Fri, Oct 7, 2011 at 11:28 AM, Diego Biurrun <email address hidden> wrote:

> On Fri, Oct 07, 2011 at 09:20:31AM +0000, David Martin wrote:
>>
>> --- test/lib/core/hostid.c 2011-09-05 08:46:53 +0000
>> +++ test/lib/core/hostid.c 2011-10-07 09:19:23 +0000
>> @@ -65,7 +65,7 @@
>> START_TEST(test_serialize_deserialize_rsa_keys)
>> {
>> unsigned int i, keyrr_len = 0;
>> - int bits[3] = {1024, 2048, 3072};
>> + int bits[3] = { 1024, 2048, 3072 };
>> RSA *key = NULL, *key_deserialized = NULL;
>
> This you can push anytime.

kk, will do so later.

>> @@ -126,6 +126,11 @@
>>
>> TCase *tc_core = tcase_create("Core");
>> + // the default test timeout of 4 seconds is too short,
>> + // generating keys in scratchbox or on the N900 takes
>> + // a while
>
> nit: no need to break the second line and /* */ comments are preferred

will fix that.

>> + tcase_set_timeout(tc_core, 120);
>
> I wonder if we should not set this in a slightly more global place.

Well, we can set it in a global variable (see the manual
http://check.sourceforge.net/doc/check_html/check_4.html#SEC19 )

The key generation seems to be the exception though with taking such
a long time.

Revision history for this message
Christof Mroz (christof-mroz) wrote :

I wouldn't set a global time limit: this way we can still catch poorly performing functions in other tests. So I'm fine with the new approach.

review: Approve
Revision history for this message
Christof Mroz (christof-mroz) wrote :

Semi-related: is there a trivial way to change the "killed" message, as printed in case of a timeout, to something more helpful? Without modifying the check framework, I mean.

lp:~martin-lp/hipl/n900-build-fix updated
6083. By David Martin

Cosmetics: Use /* */ for multiline comment in hostid unit test.

Revision history for this message
David Martin (martin-lp) wrote :

On Fri, Oct 7, 2011 at 1:21 PM, Christof Mroz <email address hidden> wrote:
> I wouldn't set a global time limit: this way we can still catch poorly performing functions in other tests. So I'm fine with the new approach.

Sounds reasonable.

> Semi-related: is there a trivial way to change the "killed" message, as printed in case of a timeout, to something more helpful? Without modifying the check framework, I mean.

To be honest, no idea. Should I set up a bug report for it and attach the strace logs?

On Fri, Oct 7, 2011 at 11:28 AM, Diego Biurrun <email address hidden> wrote:
> On Fri, Oct 07, 2011 at 09:20:31AM +0000, David Martin wrote:

>> + tcase_set_timeout(tc_core, 120);
>
> I wonder if we should not set this in a slightly more global place.

Any more thoughts on this? I've fixed the cosmetics and it would be nice to get
this fix into trunk.

Revision history for this message
Diego Biurrun (diego-biurrun) wrote :

 review approve

On Mon, Oct 10, 2011 at 12:20:38PM +0000, David Martin wrote:
> On Fri, Oct 7, 2011 at 1:21 PM, Christof Mroz <email address hidden> wrote:
> > I wouldn't set a global time limit: this way we can still catch poorly performing functions in other tests. So I'm fine with the new approach.
>
> Sounds reasonable.
>
> On Fri, Oct 7, 2011 at 11:28 AM, Diego Biurrun <email address hidden> wrote:
> > On Fri, Oct 07, 2011 at 09:20:31AM +0000, David Martin wrote:
>
> >> + tcase_set_timeout(tc_core, 120);
> >
> > I wonder if we should not set this in a slightly more global place.
>
> Any more thoughts on this? I've fixed the cosmetics and it would be nice to get
> this fix into trunk.

I tend to suspect that we may run into this problem in other places in
the future, so I'm inclined to set this (more) globally, but I don't
much mind either way. Push in whichever way you prefer.

Diego

review: Approve
Revision history for this message
René Hummen (rene-hummen) wrote :

Please add a bug to the bugtracker for future reference. Thanks to everyone for solving this.

review: Approve
Revision history for this message
Christof Mroz (christof-mroz) wrote :

On Mon, 10 Oct 2011 14:35:06 +0200, Diego Biurrun <email address hidden> wrote:

>>>> + tcase_set_timeout(tc_core, 120);
>>>
>>> I wonder if we should not set this in a slightly more global place.
>>
>> Any more thoughts on this? I've fixed the cosmetics and it would be
>> nice to get
>> this fix into trunk.
>
> I tend to suspect that we may run into this problem in other places in
> the future, so I'm inclined to set this (more) globally, but I don't
> much mind either way.

As a compromise, we could do that the next time we encounter a timeout on
the n900. But with a lower timeout, since 120 seconds are excessive even
on a phone for most operations. In this case (if I see it correctly),
multiple tests were bundled into one which stretched the runtime even
further.

Revision history for this message
David Martin (martin-lp) wrote :

Hi,

On Mon, Oct 10, 2011 at 3:52 PM, Christof Mroz <email address hidden> wrote:
> On Mon, 10 Oct 2011 14:35:06 +0200, Diego Biurrun <email address hidden> wrote:
>
>>>>> + tcase_set_timeout(tc_core, 120);
>>>>
>>>> I wonder if we should not set this in a slightly more global place.
>>>
>>> Any more thoughts on this? I've fixed the cosmetics and it would be
>>> nice to get
>>> this fix into trunk.
>>
>> I tend to suspect that we may run into this problem in other places in
>> the future, so I'm inclined to set this (more) globally, but I don't
>> much mind either way.
>
> As a compromise, we could do that the next time we encounter a timeout on
> the n900.

Think so too.

> But with a lower timeout, since 120 seconds are excessive even
> on a phone for most operations. In this case (if I see it correctly),
> multiple tests were bundled into one which stretched the runtime even
> further.

It's just the key generation that takes so long, the rest is more or less
done instantly. The 1024 bit key takes a moment, 2048 bit takes a while
and 3072 bit already feels like ages on the N900 (long enough that you start
to wonder if it crapped out on you ;)). Not sure if it's the limited CPU (600MHz?)
or the OpenSSL implementation though, maybe a bit of both.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'test/lib/core/hostid.c'
2--- test/lib/core/hostid.c 2011-10-10 11:37:16 +0000
3+++ test/lib/core/hostid.c 2011-10-10 12:08:13 +0000
4@@ -126,6 +126,10 @@
5 Suite *s = suite_create("lib/core/hostid");
6
7 TCase *tc_core = tcase_create("Core");
8+ /* the default test timeout of 4 seconds is too short,
9+ * generating keys in scratchbox or on the N900 takes a while */
10+ tcase_set_timeout(tc_core, 120);
11+
12 tcase_add_test(tc_core, test_serialize_deserialize_rsa_keys);
13 tcase_add_test(tc_core, test_serialize_deserialize_keys);
14 tcase_add_test(tc_core, test_invalid_serialize_deserialize);

Subscribers

People subscribed via source and target branches

to all changes: