Resource tracker regressed reporting negative memory

Bug #1698383 reported by Dan Smith
This bug report is a duplicate of:  Bug #1635367: Ram filter is broken since Mitaka. Edit Remove
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Undecided
Dan Smith

Bug Description

Nova's resource tracker is expected to publish negative values to the scheduler when resources are overcommitted. Nova's scheduler expects this:

https://github.com/openstack/nova/blob/a43dbba2b8feea063ed2d0c79780b4c3507cf89b/nova/scheduler/host_manager.py#L215

In change https://review.openstack.org/#/c/306670, these values were filtered to never drop below zero, which is incorrect. That change was making a complex alteration for ironic and cells, specifically to avoid resources from ironic nodes showing up as negative when they were unavailable. That was a cosmetic fix (which I believe has been corrected for ironic only in this patch:

https://review.openstack.org/#/c/230487/

Regardless, since the scheduler does the same calculation to determine available resources on the node, if the node reports 0 when the scheduler calculates -100 for a given resource, the scheduler will assume the node till has room (due to oversubscription) and will send builds there destined to fail.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/474994

Changed in nova:
assignee: nobody → Dan Smith (danms)
status: New → In Progress
Revision history for this message
Matt Riedemann (mriedem) wrote :

This is probably a duplicate of bug 1635367 and someone has been working a fix for that for awhile:

https://review.openstack.org/#/c/390984/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/474994
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=0ddf3ce01149d78ee0cf8f7497f8a9074c6f167d
Submitter: Jenkins
Branch: master

commit 0ddf3ce01149d78ee0cf8f7497f8a9074c6f167d
Author: Dan Smith <email address hidden>
Date: Fri Jun 16 07:25:40 2017 -0700

    Fix regression preventing reporting negative resources for overcommit

    In Nova prior to Ocata, the scheduler computes available resources for
    a compute node, attempting to mirror the same calculation that happens
    locally. It does this to determine if a new instance should fit on the
    node. If overcommit is being used, some of these numbers can be negative.

    In change 016b810f675b20e8ce78f4c82dc9c679c0162b7a we changed the
    compute side to never report negative resources, which was an ironic-
    specific fix for nodes that are offline. That, however, has been
    corrected for ironic nodes in 047da6498dbb3af71bcb9e6d0e2c38aa23b06615.
    Since the base change to the resource tracker has caused the scheduler
    and compute to do different math, we need to revert it to avoid the
    scheduler sending instances to nodes where it believes -NNN is the
    lower limit (with overcommit), but the node is reporting zero.

    This doesn't actually affect Ocata because of our use of the placement
    engine. However, this code is still in master and needs to be backported.
    This part of the change actually didn't even have a unit test, so
    this patch adds one to validate that the resource tracker will
    calculate and report negative resources.

    Change-Id: I25ba6f7f4e4fab6db223368427d889d6b06a77e8
    Closes-Bug: #1698383

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/475044

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/475057

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/475067

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/ocata)

Reviewed: https://review.openstack.org/475044
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=3284851437e24250d46edba20789a2e5f1f435a0
Submitter: Jenkins
Branch: stable/ocata

commit 3284851437e24250d46edba20789a2e5f1f435a0
Author: Dan Smith <email address hidden>
Date: Fri Jun 16 07:25:40 2017 -0700

    Fix regression preventing reporting negative resources for overcommit

    In Nova prior to Ocata, the scheduler computes available resources for
    a compute node, attempting to mirror the same calculation that happens
    locally. It does this to determine if a new instance should fit on the
    node. If overcommit is being used, some of these numbers can be negative.

    In change 016b810f675b20e8ce78f4c82dc9c679c0162b7a we changed the
    compute side to never report negative resources, which was an ironic-
    specific fix for nodes that are offline. That, however, has been
    corrected for ironic nodes in 047da6498dbb3af71bcb9e6d0e2c38aa23b06615.
    Since the base change to the resource tracker has caused the scheduler
    and compute to do different math, we need to revert it to avoid the
    scheduler sending instances to nodes where it believes -NNN is the
    lower limit (with overcommit), but the node is reporting zero.

    This doesn't actually affect Ocata because of our use of the placement
    engine. However, this code is still in master and needs to be backported.
    This part of the change actually didn't even have a unit test, so
    this patch adds one to validate that the resource tracker will
    calculate and report negative resources.

    Conflicts:
          nova/compute/resource_tracker.py
          nova/tests/unit/compute/test_resource_tracker.py

    NOTE(mriedem): The conflict is due to change
    I80ba844a6e0fcea89f80aa253d57ac73092773ae not being in Ocata.

    Change-Id: I25ba6f7f4e4fab6db223368427d889d6b06a77e8
    Closes-Bug: #1698383
    (cherry picked from commit 0ddf3ce01149d78ee0cf8f7497f8a9074c6f167d)

tags: added: in-stable-ocata
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/newton)

Reviewed: https://review.openstack.org/475057
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=d8b30c3772dae32ac4cbedb659f6d08eb795425a
Submitter: Jenkins
Branch: stable/newton

commit d8b30c3772dae32ac4cbedb659f6d08eb795425a
Author: Dan Smith <email address hidden>
Date: Fri Jun 16 07:25:40 2017 -0700

    Fix regression preventing reporting negative resources for overcommit

    In Nova prior to Ocata, the scheduler computes available resources for
    a compute node, attempting to mirror the same calculation that happens
    locally. It does this to determine if a new instance should fit on the
    node. If overcommit is being used, some of these numbers can be negative.

    In change 016b810f675b20e8ce78f4c82dc9c679c0162b7a we changed the
    compute side to never report negative resources, which was an ironic-
    specific fix for nodes that are offline. That, however, has been
    corrected for ironic nodes in 047da6498dbb3af71bcb9e6d0e2c38aa23b06615.
    Since the base change to the resource tracker has caused the scheduler
    and compute to do different math, we need to revert it to avoid the
    scheduler sending instances to nodes where it believes -NNN is the
    lower limit (with overcommit), but the node is reporting zero.

    This doesn't actually affect Ocata because of our use of the placement
    engine. However, this code is still in master and needs to be backported.
    This part of the change actually didn't even have a unit test, so
    this patch adds one to validate that the resource tracker will
    calculate and report negative resources.

    Conflicts:
          nova/compute/resource_tracker.py

    NOTE(mriedem): The conflict is due to change
    I6827137f35c0cb4f9fc4c6f753d9a035326ed01b not being in Newton.

    Change-Id: I25ba6f7f4e4fab6db223368427d889d6b06a77e8
    Closes-Bug: #1698383
    (cherry picked from commit 0ddf3ce01149d78ee0cf8f7497f8a9074c6f167d)
    (cherry picked from commit 3284851437e24250d46edba20789a2e5f1f435a0)

tags: added: in-stable-newton
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/mitaka)

Change abandoned by Matt Riedemann (<email address hidden>) on branch: stable/mitaka
Review: https://review.openstack.org/475067
Reason: Identity tests in Tempest are borked on Mitaka, so I'm just going to abandon this.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 16.0.0.0b3

This issue was fixed in the openstack/nova 16.0.0.0b3 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 15.0.7

This issue was fixed in the openstack/nova 15.0.7 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 14.0.8

This issue was fixed in the openstack/nova 14.0.8 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.