Upgrade to Jewel impacts cluster by taking entire node offline

Bug #1662591 reported by Billy Olsen
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ceph OSD Charm
Fix Released
High
Billy Olsen
OpenStack Ceph Charm (Retired)
Won't Fix
High
Unassigned
charms.ceph
Fix Released
High
Unassigned
ceph (Juju Charms Collection)
Invalid
Undecided
Unassigned
ceph-osd (Juju Charms Collection)
Invalid
Critical
Billy Olsen

Bug Description

When upgrading from Hammer->Jewel, all of the OSDs are stopped on a single node for the duration of the recursive chown on /var/lib/ceph. In production environments, this recursive chown is non-trivial and takes hours to complete. While all OSDs are down on the node the Ceph cluster is essentially running missing one node.

If the user set noout on the cluster, then the OSDs will not be marked as out but may have a large amount of backfilling to do when restarted. During this period, the cluster is at greater risk to an outage of another OSD/node. When one considers that nodes are becoming more dense with larger disks, this certainly fails to scale at production levels (10-20 OSDs per node @ ~4-8 TB/OSD are becoming common).

The upgrade process should intelligently decide to make use of the `setuser match path /var/lib/ceph/$type/$cluster-$id` option and change the ownership of one OSD at a time when performing a rolling upgrade across the cluster. The setuser match path option ensures that if the OSDs are restarted before the ownership change occurs that the daemon will run with the user/group of the process's root directory.

Related branches

Changed in ceph-osd (Juju Charms Collection):
milestone: none → 17.01
importance: Undecided → Critical
James Page (james-page)
Changed in ceph (Juju Charms Collection):
status: New → Invalid
Changed in charm-ceph-osd:
assignee: nobody → Billy Olsen (billy-olsen)
importance: Undecided → Critical
status: New → In Progress
Changed in ceph-osd (Juju Charms Collection):
status: In Progress → Invalid
James Page (james-page)
Changed in charm-ceph-osd:
importance: Critical → High
milestone: none → 17.05
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charms.ceph (master)

Reviewed: https://review.openstack.org/430062
Committed: https://git.openstack.org/cgit/openstack/charms.ceph/commit/?id=c421aa742909f78c8b7c9a4548874795b70dad87
Submitter: Jenkins
Branch: master

commit c421aa742909f78c8b7c9a4548874795b70dad87
Author: Billy Olsen <email address hidden>
Date: Mon Feb 6 21:59:51 2017 -0700

    Roll osd ownership changes through node

    Change the OSD upgrade path so that the file ownership change
    for the OSD directories are run one OSD at a time rather than
    all of the OSDs at once.

    Partial-Bug: #1662591

    Change-Id: I3a1cf05207c070a8699e7ba749a0587b619d4679

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ceph-osd (master)

Fix proposed to branch: master
Review: https://review.openstack.org/445265

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to charms.ceph (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/448760

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to charms.ceph (master)

Reviewed: https://review.openstack.org/448760
Committed: https://git.openstack.org/cgit/openstack/charms.ceph/commit/?id=cd75b59b3381b2f6ac64f41fa90586d6366ddd60
Submitter: Jenkins
Branch: master

commit cd75b59b3381b2f6ac64f41fa90586d6366ddd60
Author: Billy Olsen <email address hidden>
Date: Wed Mar 22 12:24:39 2017 -0700

    Fix typo when listing dirs

    The directory listing incorrectly uses os.path.listdir but the
    correct call is os.listdir.

    Change-Id: I2fe87c8a007cfdb0d54395e78597c98647cb661c
    Related-Bug: #1662591

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to charms.ceph (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/449439

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to charms.ceph (master)

Reviewed: https://review.openstack.org/449439
Committed: https://git.openstack.org/cgit/openstack/charms.ceph/commit/?id=80c8256fb632e5f783b32ef662efeeffa7c51a53
Submitter: Jenkins
Branch: master

commit 80c8256fb632e5f783b32ef662efeeffa7c51a53
Author: Billy Olsen <email address hidden>
Date: Thu Mar 23 23:28:09 2017 -0700

    Add unit tests and fix related bugs

    This change adds more unit tests to the rolling OSD upgrade
    scenario to ensure more complete coverage.

    Change-Id: Ibd7c8d6c46520957a3298446efc6b5fff210a51a
    Related-Bug: #1662591

James Page (james-page)
Changed in charm-ceph:
importance: Undecided → Medium
importance: Medium → High
status: New → Triaged
Changed in charms.ceph:
status: New → Triaged
importance: Undecided → High
Changed in charm-ceph:
milestone: none → 17.05
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ceph-osd (master)

Reviewed: https://review.openstack.org/445265
Committed: https://git.openstack.org/cgit/openstack/charm-ceph-osd/commit/?id=2c5406b6b31f77776cd31fae02aefdc1c9bf00de
Submitter: Jenkins
Branch: master

commit 2c5406b6b31f77776cd31fae02aefdc1c9bf00de
Author: Billy Olsen <email address hidden>
Date: Mon Mar 13 16:54:18 2017 -0700

    Upgrade OSDs one at a time when changing ownership

    Some upgrade scenarios (hammer->jewel) require that the ownership
    of the ceph osd directories are changed from root:root to ceph:ceph.
    This patch improves the upgrade experience by upgrading one OSD at
    a time as opposed to stopping all services, changing file ownership,
    and then restarting all services at once.

    This patch makes use of the `setuser match path` directive in the
    ceph.conf, which causes the ceph daemon to start as the owner of the
    OSD's root directory. This allows the ceph OSDs to continue running
    should an unforeseen incident occur as part of this upgrade.

    Change-Id: I00fdbe0fd113c56209429341f0a10797e5baee5a
    Closes-Bug: #1662591

Changed in charm-ceph-osd:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ceph-osd (stable/17.02)

Fix proposed to branch: stable/17.02
Review: https://review.openstack.org/462320

James Page (james-page)
Changed in charm-ceph:
milestone: 17.05 → 17.08
James Page (james-page)
Changed in charm-ceph-osd:
milestone: 17.05 → 17.08
tags: added: stable-backport
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ceph-osd (stable/17.02)

Reviewed: https://review.openstack.org/462320
Committed: https://git.openstack.org/cgit/openstack/charm-ceph-osd/commit/?id=f5b5217ae6e91a91849c55ac23cbf477e3325aba
Submitter: Jenkins
Branch: stable/17.02

commit f5b5217ae6e91a91849c55ac23cbf477e3325aba
Author: Billy Olsen <email address hidden>
Date: Wed May 3 16:37:29 2017 -0700

    Upgrade OSDs one at a time when changing ownership

    Some upgrade scenarios (hammer->jewel) require that the ownership
    of the ceph osd directories are changed from root:root to ceph:ceph.
    This patch improves the upgrade experience by upgrading one OSD at
    a time as opposed to stopping all services, changing file ownership,
    and then restarting all services at once.

    This patch makes use of the `setuser match path` directive in the
    ceph.conf, which causes the ceph daemon to start as the owner of the
    OSD's root directory. This allows the ceph OSDs to continue running
    should an unforeseen incident occur as part of this upgrade.

    Note: this cherry-pick excludes the charmhelpers changes from the
    original sync and instead includes only an update from the stable
    charm-helpers branch to include fixes for amulet tests.

    Closes-Bug: #1662591
    (cherry-picked from commit 2c5406b6b31f77776cd31fae02aefdc1c9bf00de)

    Change-Id: I92eba1b41e103d7db45ac599575437c8877607f7

Changed in charm-ceph-osd:
status: Fix Committed → Fix Released
James Page (james-page)
Changed in charm-ceph:
milestone: 17.08 → 17.11
James Page (james-page)
Changed in charms.ceph:
status: Triaged → Fix Released
James Page (james-page)
Changed in charm-ceph:
milestone: 17.11 → 18.02
Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

This is being marked as Wontfix as there is a documented migration path to ceph-mon + ceph-osd, so no fix will be made to charm-ceph

Changed in charm-ceph:
status: Triaged → Won't Fix
Revision history for this message
Felipe Reyes (freyes) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.