Merge into bzr.dev : 220464-stale-locks : Code : Bazaar

Reviewer	Date Requested	Status
John A Meinel		Approve on 2011-06-08
Martin Packman (community)		Approve on 2011-06-04
Andrew Bennetts	2011-05-26	Approve on 2011-05-27
Review via email: mp+62582@code.launchpad.net

Revision history for this message

Martin Pool (mbp) wrote on 2011-05-27:

#

A few more points here:

* In general, this won't apply to hpss smart server locks, because they are supposed to be held by the server process. (Doing so should make it more likely they'll be cleaned up if the client abruptly disconnects.) In that case you see something like this:

mbp@grace% ./bzr push
Using saved push location: lp:~mbp/bzr/220464-stale-locks
Unable to obtain lock held by <email address hidden>
at crowberry [process #27519], acquired 0 seconds ago.
See "bzr help break-lock" for more.
bzr: ERROR: Could not acquire lock "(remote lock)": bzr+ssh://bazaar.launchpad.net/~mbp/bzr/220464-stale-locks/

* We could do an interactive "do you want to break this" ui but that's out of scope.

* In related bug 257217 I'm quite inclined to say a crash-only design will be safer and simpler: if something tries to kill bzr just let it happen and we'll break the lock next time.

* I should test this interactively too....

Revision history for this message

Martin Pool (mbp) wrote on 2011-05-27:

#

This is a little tricky to hit interactively from the command line, but you can get it pretty well from the Python shell:

>>> b.lock_write()
Unable to obtain lock file:///home/mbp/bzr/work/ held by Martin Pool <email address hidden>
at grace [process #7658], acquired 42 seconds ago.
Will continue to try until 10:35:52, unless you press Ctrl-C.
See "bzr help break-lock" for more.
^CTraceback (most recent call last):
....
KeyboardInterrupt

now release lock from another window

>>> b.lock_write()
Stole lock file:///home/mbp/bzr/work/.bzr/branches/220464-stale-locks/.bzr/branch/lock from dead process LockHeldInfo({'nonce': u'gxkq1gpt3t36aketelkt', 'start_time': u'1306456480', 'hostname': u'grace', 'pid': u'7658', 'user': u'Martin Pool <email address hidden>'}).
BranchWriteLockResult(0uqc649u5zeziqwzni06, <bound method BzrBranch7.unlock of BzrBranch7(file:///home/mbp/bzr/work/.bzr/branches/220464-stale-locks/)>)

One thing that demonstrates is that it should probably print the dead process info in a somewhat more baked form...

Revision history for this message

Andrew Bennetts (spiv) wrote on 2011-05-27:

#

I like this a lot. You seem to have taken a lot of care with the various possible cases, and you've done some nice refactoring and tidying too.

My only real question is if it's appropriate for is_lock_holder_known_dead to raise if os.kill raises something other than ESRCH or EPERM? On one hand other errnos shouldn't happen, at least on Linux, but even if they do aborting the bzr command entirely is probably a bit unfriendly to the user. Emitting a warning and then returning False might be better?

Apart from that I just have unimportant pedantry:

In test_auto_break_stale_lock:

646 + self.assertRaises(errors.LockBroken,
647 + l1.unlock)

That would fit comfortably on just one line :)

Ditto in test_auto_break_stale_lock_configured_off.

713 +* Information about held lockdir locks returned from eg `LockDir.peek` is
714 + now represented as a `LockHeldInfo` object, rather than a plain Python
715 + dict.

I think to be pendantically correct ReST you want double-backticks for an inline literal. Single backticks means interpreted text with the default role, which I think is unset in our sphinx conf. Hmm, I suppose it would be reasonable to set it to something that automatically links to our API docs, I think pydoctor provides an “api” role for this…

review: Approve

Revision history for this message

Andrew Bennetts (spiv) wrote on 2011-05-27:

#

I agree btw that the output isn't particularly pretty. Something formatted more like the usual LockContention output would be nicer. But I'd much rather have this with ugly output than not have it at all.

Revision history for this message

Martin Pool (mbp) wrote on 2011-05-27:

#

> I like this a lot. You seem to have taken a lot of care with the various
> possible cases, and you've done some nice refactoring and tidying too.

Thanks!

> My only real question is if it's appropriate for is_lock_holder_known_dead to
> raise if os.kill raises something other than ESRCH or EPERM? On one hand
> other errnos shouldn't happen, at least on Linux, but even if they do aborting
> the bzr command entirely is probably a bit unfriendly to the user. Emitting a
> warning and then returning False might be better?

Historically we have not tended to do that; I think Rob argued that these warnings are not normally going to be checked by tests and so could either cause false passes, or could have latent problems where the warning is emitted all the time.

However in this case I think users would almost certainly not thank us for having bzr stop there unnecessarily. If we do want to test it, there are ways.

> I think to be pendantically correct ReST you want double-backticks for an
> inline literal. Single backticks means interpreted text with the default
> role, which I think is unset in our sphinx conf. Hmm, I suppose it would be
> reasonable to set it to something that automatically links to our API docs, I
> think pydoctor provides an “api” role for this…

I think we should do that. Using double backticks for shell or computer output, and single for API references seems like the best way to spend our backtick budget. (Unless we want to get all docbooky and specifically mark roles on computer input and output, but that seems too longwinded.)

Revision history for this message

Martin Packman (gz) wrote on 2011-05-27:

#

See <lp:~gz/bzr/220464-stale-locks> for some changes built on this branch. It includes fixes for some trivial things, and an implementation (well, two really, the pywin32 ones needs testing) of process deadness detection for windows.

The current os.kill detection gets broken pretty thoroughly by Python 2.7 which includes a function of that name on windows that doesn't behave anything like the posix one. Currently my branch doesn't have any fallback (`lambda pid: False` would be reasonable) as posix+windows should cover everything.

review: Needs Fixing

Revision history for this message

John A Meinel (jameinel) wrote on 2011-05-27:

#

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 05/27/2011 02:06 AM, Martin Pool wrote:
> Martin Pool has proposed merging lp:~mbp/bzr/220464-stale-locks into lp:bzr.
>
> Requested reviews:
> bzr-core (bzr-core)
> Related bugs:
> Bug #220464 in Bazaar: "Bazaar doesn't detect its own stale locks"
> https://bugs.launchpad.net/bzr/+bug/220464
>
> For more details, see:
> https://code.launchpad.net/~mbp/bzr/220464-stale-locks/+merge/62582
>
> bzr automatically detects stale locks originating from other processes on the same machine.
>
> This is a somewhat old branch, but I think it's actually all finished, aside from not handling detection of live/dead processes on Windows. There was a list thread about that but it looks a bit nontrivial, so I'd like to handle that separately.
>
> I think the config variable needs to be renamed to match our current convention.

Unfortunately on newer Python's os.kill *is* available on windows, but
doesn't work like you want it to. So we probably need a "sys.platform ==
'windows'" check as well.
+ if getattr(os, 'kill', None) is None:
+ # Probably not available on Windows.
+ # XXX: How should we check for process liveness there?
+ return False

review: needsfixing

Otherwise it looks good to me.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk3fW3wACgkQJdeBCYSNAAP/IwCeJ2ATSFX6BWF7zWoTxA9ztlV7
jMAAnRiKL5tYKqJsOtt7h9o7duA/zqjB
=fBtG
-----END PGP SIGNATURE-----

review: Needs Fixing

Revision history for this message

John A Meinel (jameinel) wrote on 2011-05-27:

#

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 05/27/2011 04:29 AM, Martin [gz] wrote:
> Review: Needs Fixing
> See <lp:~gz/bzr/220464-stale-locks> for some changes built on this branch. It includes fixes for some trivial things, and an implementation (well, two really, the pywin32 ones needs testing) of process deadness detection for windows.
>
> The current os.kill detection gets broken pretty thoroughly by Python 2.7 which includes a function of that name on windows that doesn't behave anything like the posix one. Currently my branch doesn't have any fallback (`lambda pid: False` would be reasonable) as posix+windows should cover everything.
>
>

I thought that was also in python2.6. Regardless, it needs to be done
differently.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk3fW+UACgkQJdeBCYSNAAOMswCeNmvO/1LuTFKZjMGGZON+1YBb
dQcAoLq19easU8FMCfcEDJxqcLCZHzAi
=kwJ1
-----END PGP SIGNATURE-----

Revision history for this message

Martin Pool (mbp) wrote on 2011-06-03:

#

@jam: gz kindly provided what looks like a working Windows version of a factored-out is_local_pid_dead, so I think we're now ok there. Please let me know if you spot anything else.

Revision history for this message

Martin Pool (mbp) wrote on 2011-06-03:

#

I'd like some rereview on this, at least just so John can look at gz's windows code.

Revision history for this message

Robert Collins (lifeless) wrote on 2011-06-03:

#

A small question - without reading the diff (sorry, but it seemed a
little large).

Will this result in Launchpad smart servers thinking a lock is held by
a dead, when there are two smart server *machines* with one backend
store mounted over (NFS|OCFS) and the active process is on a different
machine?

-Rob

Revision history for this message

Martin Pool (mbp) wrote on 2011-06-03:

#

On 3 June 2011 18:56, Robert Collins <email address hidden> wrote:
> A small question - without reading the diff (sorry, but it seemed a
> little large).
>
> Will this result in Launchpad smart servers thinking a lock is held by
> a dead, when there are two smart server *machines* with one backend
> store mounted over (NFS|OCFS) and the active process is on a different
> machine?

Only if they have the same hostname, which they shouldn't.

More of a worry is if some manages to have a client machine with the
same hostname as the ssh server, and the names are stored
not-fully-qualified, and therefore they get confused about identity.

Perhaps we need something stronger than that. We could put a random
per-user key into their home directory and into the lock file, though
that would be a larger change.

Revision history for this message

Robert Collins (lifeless) wrote on 2011-06-03:

#

The impact of a mistake could be pretty bad, so perhaps we should look
at something stronger before doing this?

Revision history for this message

Martin Packman (gz) wrote on 2011-06-04:

#

As windows and posix should cover everything, I think this is fine now.

> I'd like some rereview on this, at least just so John can look at gz's windows
> code.

I'd like that too, was coded a little too late at night.

A few notes it:

* I did a ctypes and pywin32 version out of habit. Now we're 2.6 only, there's not much point in having pywin32 code around (apart from where it's easier to spell).
* As Martin changed the posix version to never raise an exception, should the windows versions change too? Really, I think I'd prefer the caller to catch and log rather than the function, but pywin32 makes that annoying by using a different exception type.
* OpenProcess requests the PROCESS_TERMINATE process access right. It might make more sense to ask for PROCESS_QUERY_INFORMATION or something else.
* Could add code later to get the process executable name too with GetProcessImageFileName, as we have the handle open anyway. Being able to say there's a running bzr.exe holding the lock rather than some other process that's inherited the pid may be useful.

review: Approve

Revision history for this message

Vincent Ladeuil (vila) wrote on 2011-06-06:

#

sent to pqm by email

Revision history for this message

Martin Pool (mbp) wrote on 2011-06-06:

#

On 3 June 2011 19:34, Robert Collins <email address hidden> wrote:
> The impact of a mistake could be pretty bad, so perhaps we should look
> at something stronger before doing this?

I wrote some background about other types of locking improvements we
can do, but we probably all know that. This patch is just about
trying to detect locks belong to processes that are sure to be dead.
As I said above, the main risk is that we'll see a lock actually from
another machine and assume it's actually local. It is a little hard
because there is on unix no obvious guaranteed-unique machine
identifier.

Things we could do to guard against that:

0- don't use this approach at all
1- put in the host default ip (also not guaranteed unique; likely to
often be seen as 127.0.0.1; also arguably a privacy problem to send
this to lp)
2- disable this on the smart server; punt back to the client to decide
if the process is live or not (maybe a good idea anyhow)
3- make up a unique per-user nonce; store it in .bazaar; put it into
lock files (not a perfect solution for home on NFS for instance)
4- don't automatically break it; just ask the user if they want to do
so, or point out that it probably is dead
5- grab some other random data that might both be stable across
multiple processes on a single machine, and across reboots, and also
unique to machines (this does exist on windows, and might be available
in Python)

Revision history for this message

John A Meinel (jameinel) wrote on 2011-06-06:

#

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 6/6/2011 12:02 PM, Vincent Ladeuil wrote:
> The proposal to merge lp:~mbp/bzr/220464-stale-locks into lp:bzr has been updated.
>
> Status: Approved => Needs review
>
> For more details, see:
> https://code.launchpad.net/~mbp/bzr/220464-stale-locks/+merge/62582

Vila, you changed this to needs-review after approving it and sending it
to pqm. Did it bounce?

I'll try to take a look at it tomorrow, but my WinAPI isn't all that strong.

I do think gz's comment about access to kill vs access to get info is
relevant. Specifically, if User A does the push, User B cannot kill
their processes. Would that lead to us thinking the process is already
dead, and unlocking things underneath them?

John
=:->
review: needsinfo
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk3s5IYACgkQJdeBCYSNAAObsACg0rZOrcMV24l6VkfbXQf6TtBw
rmcAn0OtBiyVbXAvzGp1Ldburw9E0CP5
=PEAW
-----END PGP SIGNATURE-----

review: Needs Information

Revision history for this message

John A Meinel (jameinel) wrote on 2011-06-06:

#

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 6/3/2011 11:34 AM, Robert Collins wrote:
> The impact of a mistake could be pretty bad, so perhaps we should look
> at something stronger before doing this?
>

I think the potential for trouble is moderate here, mostly because of
the lifetimes of locks. If two people race on the *repository* lock, you
could get strange corruption. (Both feel they uploaded their content,
but only one of them ends up with their file mentioned in pack-names.)

That, however is meant to be a very fast lock, and shouldn't be opened
by VFS anymore. (I'm sure it could still happen, but we probably would
be more productive fixing those cases rather than spending too much time
on the locking code.)

If a Branch/WT lock gets out of sync, it should be a lot easier to
recover. The locks there seem a bit more advisory. (I'm pushing to this
branch, so don't try to push to it until I'm done.) Because the actual
mutations should be pretty close to atomic. (overwrite
.bzr/branch/last_revision, etc.)

gz saying that we may not be able to detect other people's processes as
being alive is worrying.

Having the shared host be named "ubuntu" and your local host be named
"ubuntu" could be a problem. Would it be reasonable as a first step to
only support local host transports? (So we auto-unlock
file://path/to/branch, but not bzr+ssh://remote-host/path/to/branch.)

I specifically mention "ubuntu" because I believe all ubuntu installs
default to naming the machine "ubuntu" and someone has to actively
change the machine name. (May not be as true anymore.) I think some of
this stems from using a live-install where the machine needs a name
before actually installing.

What about a compromise, and auto-unlocking local objects only, until we
get something like the unique random id in .bazaar, etc. (Note that you
could still get bitten if someone copies their .bazaar folder...)

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk3s5hsACgkQJdeBCYSNAAM7vACfSMvettIMp/bIGOpf/Ki7zm+d
yeMAoMEnK2O/8IpiABRLcS67TxlJD+eD
=Rcq8
-----END PGP SIGNATURE-----