ec2 instance IDs are broken after folsom upgrade

Bug #1061166 reported by Adam Gandelman
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Critical
Adam Gandelman
Folsom
Fix Released
Critical
Adam Gandelman
nova (Ubuntu)
Fix Released
High
Unassigned

Bug Description

After upgrading a running cluster from Essex to Folsom (2012.2-0ubuntu2~cloud0), the EC2 API can no longer find instances by EC2 ID, ie:

ubuntu@ip-10-252-38-15:~$ euca-describe-instances
RESERVATION r-90qer9xs 7b58c0e219a948ef942071c871d05b8b default
INSTANCE i-00000006 ami-00000001 192.168.25.4 192.168.25.4 running adam (7b58c0e219a948ef942071c871d05b8b, ip-10-252-40-134) 0 m1.tiny 2012-10-03T18:35:03.000Z novamonitoring-disabled 192.168.25.4 192.168.25.4 instance-store
RESERVATION r-k190bhzm 7b58c0e219a948ef942071c871d05b8b default
INSTANCE i-00000004 ami-00000001 192.168.25.5 192.168.25.5 running adam (7b58c0e219a948ef942071c871d05b8b, ip-10-252-40-134) 0 m1.tiny 2012-10-03T18:22:49.000Z novamonitoring-disabled 192.168.25.5 192.168.25.5 instance-store
ubuntu@ip-10-252-38-15:~$ euca-get-console-output i-00000004
InstanceNotFound: Instance i-00000004 could not be found.

A quick test of other calls shows that this affects many actions:

euca-get-console-output
euca-terminate-instance
euca-attach-volumes
euca-associate-address

Should note that describing individual instances still works (ie, euca-describe-instances i-00000004) still works as expected. A quick glance at the code shows this call converts the EC2 instance ID to UUID with ec2utils.ec2_inst_id_to_uuid() rather than ec2utils.ec2_id_to_id() like the other calls.

This seems to not affect clusters that have been initially installed with Folsom. I suspect something is wrong in the data migration from Essex to Folsom that causes the ec2_id_to_id() mapping to be bogus.

tags: added: openstack-ubuntu-upgrade
Dave Walker (davewalker)
Changed in nova (Ubuntu):
importance: Undecided → High
Revision history for this message
Adam Gandelman (gandelman-a) wrote :

After taking another look at this, EC2 instance IDs appear to change during an upgrade.

Did an upgrade with two running instances, i-00000003 and i-00000004. After upgrade, they show change to i-00000003 and i-00000004, respecively. Any EC2 action (terminate, get-console-output, etc) on these new IDs fail with InstanceNotFound.

AFAICS:

In essex, these IDs appear mapped to IDs and UUIDs in the instances table:

mysql> select id, uuid from instances where id=3 or id=4;
+----+--------------------------------------+
| id | uuid |
+----+--------------------------------------+
| 3 | 7ca30bdc-46f3-4a2d-a0fb-f7657d60e8a8 |
| 4 | bf7ecf97-0814-4b5e-b5f1-4a7cecd8a43f |
+----+--------------------------------------+
2 rows in set (0.00 sec)

In Folsom, much has been converted to use UUIDs instead. the EC2 describe instances call gets a list of instances, and queries the instance_id_mappings for the corresponding instance ID (which will later get converted to an ec2 ID). However, after upgrade this table has duplicate entries for these instances, causing later queries on these instances to yield a new, incorrect ID.

mysql> select * from instance_id_mappings where uuid='bf7ecf97-0814-4b5e-b5f1-4a7cecd8a43f' or uuid='7ca30bdc-46f3-4a2d-a0fb-f7657d60e8a8';
+---------------------+------------+------------+---------+----+--------------------------------------+
| created_at | updated_at | deleted_at | deleted | id | uuid |
+---------------------+------------+------------+---------+----+--------------------------------------+
| NULL | NULL | NULL | NULL | 3 | 7ca30bdc-46f3-4a2d-a0fb-f7657d60e8a8 |
| NULL | NULL | NULL | NULL | 4 | bf7ecf97-0814-4b5e-b5f1-4a7cecd8a43f |
| 2012-10-04 01:50:23 | NULL | NULL | 0 | 5 | 7ca30bdc-46f3-4a2d-a0fb-f7657d60e8a8 |
| 2012-10-04 01:50:23 | NULL | NULL | 0 | 6 | bf7ecf97-0814-4b5e-b5f1-4a7cecd8a43f |
+---------------------+------------+------------+---------+----+--------------------------------------+

This can be worked around by change the EC2 layer to convert all incoming IDs to UUIDs, but the issue of instance IDs changing during a migration to Folsom is critical and needs to be resolved as an upgrade issue.

Revision history for this message
Adam Gandelman (gandelman-a) wrote :

Oops: should have read that instances i-00000003 and i-00000004 changed post-upgrade to i-00000005 and i-00000006

Chuck Short (zulcss)
tags: added: ec2
Revision history for this message
Andrew Glen-Young (aglenyoung) wrote :

I upgraded today and found that I had this issue.

Not all of my instances had duplicate entries in the `instance_id_mappings` table, however I did have more entries within the `instance_id_mappings` than within the `instances` table.

It seems that the `id` column in the `instances` table is still being used somewhere.

In order to work around the problem I needed to set the auto_increment integer to be the same for each table.

Example:

    -- grab the auto_increment integer for `instances` table
    SELECT Auto_increment FROM information_schema.tables WHERE table_name='instances' AND table_schema='nova';

    -- grab the auto_increment integer for `instances` table
    SELECT Auto_increment FROM information_schema.tables WHERE table_name='instance_id_mappings' AND table_schema='nova';

    -- raise the lowest number returned to the same as the highest for the relevant table.
    ALTER TABLE instances AUTO_INCREMENT = 1769801923;

Revision history for this message
Chuck Short (zulcss) wrote :
Revision history for this message
Chuck Short (zulcss) wrote :

It looks like the id column should have been set to auto-increment after the data copy was done

Revision history for this message
Adam Gandelman (gandelman-a) wrote :

From what I can gather the 107 migration only copies the instance's ID and UUID to the new mapping table, but leaves the other columns NULL, specifically 'deleted'.

In Folsom, an ec2 describe instances call will check that the instances.id <-> uuid mapping exists in the instances_id_mapping table and create one if it does not find one.

The query that checks the instances_id_mapping table has a 'deleted=0' constraint. In this case, all of the copied mappings have a NULL value here. Finding no mapping, it creates a new one and throws the whole thing off.

It makes sense that this works fine on fresh Folsom installs, because new instances get a new mapping created in this empty table with the appropriate columns filled.

There is still a relationship between instances.id and instance_id_mappings.id, and the duplicate entries being created in the mapping table throws that relationship off (after the table has duplicate entires created, new instances EC2 IDs are also off). I'm not sure if there needs to be a FK constraint between instances.id and instance_id_mappings.id to ensure that relationship is in-tact, even if the mappings table is being polluted with duplicates.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/14063

Changed in nova:
assignee: nobody → Adam Gandelman (gandelman-a)
status: New → In Progress
Revision history for this message
Andrew Glen-Young (aglenyoung) wrote :

The loose relationship between two auto_incrementing `id` columns of different tables is insanely brittle. I am not convinced that the intention was for the relation to work this way. It may simple be a problem with part of the code referencing `id` instead of `uuid`.

The uuid's in the `instances_id_mapping` should, at the minimum, include a unique constraint.

Thierry Carrez (ttx)
Changed in nova:
importance: Undecided → Critical
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in nova (Ubuntu):
status: New → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/14063
Committed: http://github.com/openstack/nova/commit/1d0402cf65458c941639f01334a996c11e592018
Submitter: Jenkins
Branch: master

commit 1d0402cf65458c941639f01334a996c11e592018
Author: Adam Gandelman <email address hidden>
Date: Thu Oct 4 14:10:32 2012 -0700

    Set read_deleted='yes' for instance_id_mappings.

    Since the migration that creates the instance_id_mappings does
    not populated the 'deleted' column, queries to this table should
    not limit results to 'deleted=0'. Limiting to non-deleted rows
    results in duplicate mappings being created for existing instance
    mappings after an upgrade, and throws off the entire EC2 instance
    ID to UUID mapping.

    Fixes LP: #1061166

    Change-Id: I8893954fcae94a71dcc284c1b3b23b53901437eb

Changed in nova:
status: In Progress → Fix Committed
Revision history for this message
Jon Proulx (jproulx) wrote :

This doesn't seem to be working for me, perhaps I implemented it incorrectly, or need to take further remedial action?

(using ubuntu 12.04 coud archive version of nova-api as a base)

copied the patched verion of nova/db/sqlalchemy/api.py to

/usr/lib/python2.7/dist-packages/nova/db/sqlalchemy/api.py
and
/usr/local/nova/nova/db/sqlalchemy/api.py

then pycompiled /usr/lib/python2.7/dist-packages/nova/db/sqlalchemy/api.py (as there was an existing file of that name), I've restated the api service and even rebooted the system and still get the InstanceNotFound errors when attempting tho delete (euca-terminate-instances) or assign floating ips (euca-associate-address)

Revision history for this message
Adam Gandelman (gandelman-a) wrote :

Jon-

Unfortunately if you've hit this bug, your instance_id_mappings table is already polluted with duplicate entries and the mapping between EC2 IDs and UUIDs is skewed. Someone may know a better work around, but in the meantime I'd take a look at the instance_id_mappings table and clean it of duplicate entries, ensuring the original rows remain (those copied during the original migration) if you wish to get back the original EC2 IDs.

Revision history for this message
Jon Proulx (jproulx) wrote :

I see that for instances that were running during the upgrade, but I'm getting this error with new instances that do not have duplicate entries:

euca-describe-instances i-00004fb6
RESERVATION r-ttje080y 98333a1a28e746fa8c629c83a818ad57 open
INSTANCE i-00004fb6 ami-00000027 flood-x26 flood-x26 running 0 m1.tiny 2012-10-09T14:42:16.000Z nova

euca-terminate-instances i-00004fb6
InstanceNotFound: Instance i-00004fb6 could not be found.

nova show shows this as having UUID 3f5d9fbf-eb02-4531-bacd-90df69f1233a and converting the ec2 id to decimal gives me ID 20406

mysql> select * from instance_id_mappings where uuid='3f5d9fbf-eb02-4531-bacd-90df69f1233a' or id=20406 ;
+---------------------+------------+------------+---------+-------+--------------------------------------+
| created_at | updated_at | deleted_at | deleted | id | uuid |
+---------------------+------------+------------+---------+-------+--------------------------------------+
| 2012-10-09 14:42:16 | NULL | NULL | 0 | 20406 | 3f5d9fbf-eb02-4531-bacd-90df69f1233a |
+---------------------+------------+------------+---------+-------+--------------------------------------+
1 row in set (0.00 sec)

mysql> select created_at,updated_at,deleted_at,id,uuid,hostname from instances where uuid='3f5d9fbf-eb02-4531-bacd-90df69f1233a' or id=20406 ;
+---------------------+---------------------+------------+-------+--------------------------------------+-----------+
| created_at | updated_at | deleted_at | id | uuid | hostname |
+---------------------+---------------------+------------+-------+--------------------------------------+-----------+
| 2012-10-09 14:42:16 | 2012-10-09 14:42:47 | NULL | 20297 | 3f5d9fbf-eb02-4531-bacd-90df69f1233a | flood-x26 |
+---------------------+---------------------+------------+-------+--------------------------------------+-----------+
1 row in set (0.00 sec)

do I need to ensure instance_id_mappings.id == instances.id? thats' a little ugly, but possible to fix up on my end.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/folsom)

Fix proposed to branch: stable/folsom
Review: https://review.openstack.org/14240

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package nova - 2012.2-0ubuntu4

---------------
nova (2012.2-0ubuntu4) quantal; urgency=low

  * debian/patches/ubuntu/ubuntu-fix-ec2-instance-id-mappings.patch:
    Backport from trunk, Set read_deleted='yes' for instance_id_mappings.
    (LP: #1061166)
 -- Chuck Short <email address hidden> Tue, 09 Oct 2012 11:51:15 -0500

Changed in nova (Ubuntu):
status: Confirmed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/folsom)

Reviewed: https://review.openstack.org/14240
Committed: http://github.com/openstack/nova/commit/b7e509af5d8bc3f9e26dea5cf5121c0f62348dc0
Submitter: Jenkins
Branch: stable/folsom

commit b7e509af5d8bc3f9e26dea5cf5121c0f62348dc0
Author: Adam Gandelman <email address hidden>
Date: Thu Oct 4 14:10:32 2012 -0700

    Set read_deleted='yes' for instance_id_mappings.

    Since the migration that creates the instance_id_mappings does
    not populated the 'deleted' column, queries to this table should
    not limit results to 'deleted=0'. Limiting to non-deleted rows
    results in duplicate mappings being created for existing instance
    mappings after an upgrade, and throws off the entire EC2 instance
    ID to UUID mapping.

    Fixes LP: #1061166

    Change-Id: I8893954fcae94a71dcc284c1b3b23b53901437eb
    (cherry picked from commit 1d0402cf65458c941639f01334a996c11e592018)

tags: added: in-stable-folsom
Revision history for this message
Jon Proulx (jproulx) wrote :

To answer my previous question for posterity, yes after removing duplicates and resetting the auto incremnt as described above and mismatch between the ids in instances and instance_id_mappings must be recitified:

UPDATE instance_id_mappings,instances SET instance_id_mappings.id=instances.id WHERE instances.uuid=instance_id_mappings.uuid and instance_id_mappings.id<>instances.id;

seems to me instance_id_mappings only contains duplicated information which can only lead ot suffering like this, guess I should go read the code and revision history if I want to see how this came to be.

Changed in nova (Ubuntu):
assignee: nobody → sushanta mishra (sushanta099)
assignee: sushanta mishra (sushanta099) → nobody
Thierry Carrez (ttx)
Changed in nova:
milestone: none → grizzly-1
status: Fix Committed → Fix Released
Mark McLoughlin (markmc)
tags: removed: in-stable-folsom
Thierry Carrez (ttx)
Changed in nova:
milestone: grizzly-1 → 2013.1
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.