> Well, that's unexpected since all bootstack instances seems to have > unique MACs [1] I've checked that before proposing this MP with 'juju run --service ci-airline-tr-rabbit-worker "python -c 'import uuid; print uuid.uuid1().get_node()'"' (edited for readability): - MachineId: "4" Stdout: 274973446818410 UnitId: ci-airline-tr-rabbit-worker/0 - MachineId: "5" Stdout: 274973443101009 UnitId: ci-airline-tr-rabbit-worker/1 - MachineId: "6" Stdout: 274973449401742 UnitId: ci-airline-tr-rabbit-worker/2 - MachineId: "7" Stdout: 274973436426391 UnitId: ci-airline-tr-rabbit-worker/3 - MachineId: "8" Stdout: 274973440939877 UnitId: ci-airline-tr-rabbit-worker/4 - MachineId: "9" Stdout: 274973446307917 UnitId: ci-airline-tr-rabbit-worker/5 - MachineId: "10" Stdout: 274973450245143 UnitId: ci-airline-tr-rabbit-worker/6 - MachineId: "11" Stdout: 274973446513838 UnitId: ci-airline-tr-rabbit-worker/7 - MachineId: "12" Stdout: 274973437559701 UnitId: ci-airline-tr-rabbit-worker/8 - MachineId: "13" Stdout: 274973448825124 UnitId: ci-airline-tr-rabbit-worker/9 - MachineId: "14" Stdout: 274973436691689 UnitId: ci-airline-tr-rabbit-worker/10 - MachineId: "15" Stdout: 274973452322400 UnitId: ci-airline-tr-rabbit-worker/11 - MachineId: "16" Stdout: 274973444235950 UnitId: ci-airline-tr-rabbit-worker/12 - MachineId: "17" Stdout: 274973449976901 UnitId: ci-airline-tr-rabbit-worker/13 - MachineId: "18" Stdout: 274973447132163 UnitId: ci-airline-tr-rabbit-worker/14 - MachineId: "19" Stdout: 274973440732073 UnitId: ci-airline-tr-rabbit-worker/15 - MachineId: "20" Stdout: 274973446679760 UnitId: ci-airline-tr-rabbit-worker/16 - MachineId: "21" Stdout: 274973442815065 UnitId: ci-airline-tr-rabbit-worker/17 - MachineId: "22" Stdout: 274973443908206 UnitId: ci-airline-tr-rabbit-worker/18 - MachineId: "23" Stdout: 274973442948691 UnitId: ci-airline-tr-rabbit-worker/19 Ok, host IDs are unique among all the workers. > and RTC are synced [2] Well, 'juju run --all date' can only reveal big differences but the timestamps in the logs are more precise anyway and they do tell that the clocks are roughly synced (no significant divergence). So despite wikipedia, your intuition and mine about how this is supposed to work, real life say it doesn't work that way. > > UUIDv1 are composed from the generating unit MAC and a .1 ms precision > timestamp [3], there is a theoretic very narrow window for collision > if the MACs are unique, there must be something else going on. Yup, what can be that something else then ? Looking at the code: def uuid1(node=None, clock_seq=None): """Generate a UUID from a host ID, sequence number, and the current time. If 'node' is not given, getnode() is used to obtain the hardware address. If 'clock_seq' is given, it is used as the sequence number; otherwise a random 14-bit sequence number is chosen.""" getnode() is used they say but: if _uuid_generate_time and node is clock_seq is None: _buffer = ctypes.create_string_buffer(16) _uuid_generate_time(_buffer) return UUID(bytes=_buffer.raw) No get_node() there and: if node is None: node = getnode() return UUID(fields=(time_low, time_mid, time_hi_version, clock_seq_hi_variant, clock_seq_low, node), version=1) But that's too late. So at first glance, I'd say that this is genuine bug in uuid.uuid1() that get_node() is not called despite receiving node=None as a parameter and that we're only using time here which means our clocks are synced enough to collide. And *that* I could accept ;) > > In a pseudo-random scenario (VMs w/o any true-random HW) I'd say there > is more chance for system-wide collision using UUIDv4 than v1. That's quite extreme indeed but is not what we're running into: we're only relying on time and we happen to have good sync between our units, that's enough to run into collisions. > > I don't really see collisions happening in the TS/DB domain, even if > the ticket populating rate was as high as .1ms(it's far from it), > postgres enforces uniqueness, so the ticket creation would fail. As long as you have a unique worker creating the uuids, yes, you're safe. If you start having more then you may run into the same issue and the other worker will fail, so you'll have to deal with that failure by re-trying right (or just switch to uuid4() to avoid collisions) ? > From > that time on, we can assume the TS and the GK would be always operated > with unique UUIDs Only if you have a single worker relyig on uuid1(). > or fail normally with 404. Not sure I follow here, do you mean you let the error propagate to the user bu a 404 ? I for one will probably will confused by getting a 404 when trying to *create* a new object... > Moreover, soon we will > need to order/recover-timestamp from UUID-named swift containers and > that is only possible with v1. Then the sonner the better, but let's find a working uuid scheme first ;) > > So, IMO, TS/DB/GK don't have to be patched, since they do not have > problems with v1, then you can safe yourself from the south migration > embarrassment > (never edit existing migrations, it default the whole > purpose of having them). Sure, the patch was mentioned only to outlight where we're using uuid1(), I had no intention to modify a generated file ;) > > That restrict us to the rabbit/queues environment, where you recently > found that timestamp collision. I don't think we should restrict the area that much, I've tested uuid4() for my issue and it fixes it, so the issue really seems to be that uuid1() is not good enough despite what you and I can think, Paul was right, no matter how hard this is to believe, uuid1() collides with synced clocks probably because it ignores host IDs. > > I am happy to help you investigate it further and find an appropriate > solution for this. Thanks, highly appreciated ! Can you have a look at the uuid code and tell me if you agree or disagree with the analysis above ?