[quincy] invalid osd_class_dir blocks rados client connections

Bug #1986747 reported by Dan Hill
38
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
Fix Released
Undecided
Unassigned
Yoga
Fix Released
Undecided
Unassigned
ceph (Ubuntu)
Fix Released
Critical
Chris MacNaughton
Jammy
Fix Released
Critical
Chris MacNaughton
Kinetic
Fix Released
Critical
Chris MacNaughton

Bug Description

[Impact]

A Ceph cluster that has upgraded from before 19.04 will break upon upgrade to Ceph Quincy (Jammy and Kinetic).

Ubuntu packaging is configuring `osd_class_dir` with a relative path `CMAKE_INSTALL_LIBDIR` instead of the required absolute path `CMAKE_INSTALL_FULL_LIBDIR` [0].

The default value for `osd_class_dir` changed in Quincy, starting with v17.1.0 [1].

The ceph-osd service relies on the `osd_class_dir` path to find and load class libraries that extend RADOS [2]. When this is set incorrectly, RADOS clients fail with repeated "Operation not supported" errors:
```
2022-08-16T17:42:15.044+0000 7fe375685e40 0 rgw main: ERROR: failed reading data (obj=default.rgw.log:bucket.sync-target-hints.), r=-95
2022-08-16T17:42:15.048+0000 7fe375685e40 0 rgw main: ERROR: failed to read targets index for bucket=:[]) r=-95
2022-08-16T17:42:15.048+0000 7fe375685e40 0 rgw main: ERROR: failed to initialize bucket sync policy handler: get_bucket_sync_hints() on bucket=-- returned r=-95
2022-08-16T17:42:15.048+0000 7fe375685e40 -1 rgw main: ERROR: could not initialize zone policy handler for zone=default
2022-08-16T17:42:15.048+0000 7fe375685e40 0 rgw main: ERROR: failed to start notify service ((95) Operation not supported
2022-08-16T17:42:15.048+0000 7fe375685e40 0 rgw main: ERROR: failed to init services (ret=(95) Operation not supported)
```

The ceph-osd service will also report `_load_class` errors:
```
2022-08-16T19:05:55.562+0000 7f4770ff9700 0 _load_class could not stat class lib/x86_64-linux-gnu/rados-classes/libcls_rbd.so: (2) No such file or directory
```

Admins can resolve this issue by manually setting `osd_class_dir` to the correct value. Run the following command on a ceph-mon:
```
sudo ceph config set global osd_class_dir /usr/lib/x86_64-linux-gnu/rados-classes
```

Then restart all ceph-osd services to pick up the new `osd_class_dir` location.

[0] https://cmake.org/cmake/help/v3.24/module/GNUInstallDirs.html#result-variables
[1] https://github.com/ceph/ceph/commit/3bee4b02611459b9ae949cebf5967e4d83ef55de
[2] https://docs.ceph.com/en/latest/dev/osd-class-path/

[Test Plan]

1. Install Ceph at Bionic
2. Upgrade through to Jammy
  a. Confirm that client usage is broken
3. Upgrade to Jammy-proposed
  a. Confirm that client usage works again

In addition to client activity, it can be confirmed that the OSDs don't have error logs about failing to load classes.

[Where problems could occur]

Problems could occur as a result of library paths changing, so Ceph functionality should be verified. This will be done with functional tests of Ceph using the Ceph Juju charms.

Related branches

Dan Hill (hillpd)
tags: added: seg sts
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in ceph (Ubuntu Jammy):
status: New → Confirmed
Changed in ceph (Ubuntu):
status: New → Confirmed
Revision history for this message
Dan Hill (hillpd) wrote :

While attempting to recreate this issue, we noticed a new Jammy/Quincy installation will load the correct library files!

The freshly deployed Quincy reports the incorrect `osd_class_dir` (lib/x86_64-linux-gnu/rados-classes), but lsof shows the ceph-osd process is loading rados-classes libraries from /usr/lib:
ceph-osd 735836 64045 mem REG 8,2 530688 35521076 /usr/lib/x86_64-linux-gnu/rados-classes/libcls_rbd.so.1.0.0

How is this happening?

It turns out that the correct library file is loaded by accident. New Jammy installations have /lib symlinked to /usr/lib. The ceph-osd processes have a working directory of "/" so the relative "lib/" path follows the symlink to the correct "/usr/lib" location.

Starting with 19.04, merged-usr is now the default for new installations [0]. This merged-usr feature [1] is what creates the symlink from /lib -> /usr/lib.

The key thing to note here is that these symlinks do not get created on hosts that have been upgraded:
"Existing systems, upon upgrade, will not be reconfigured for merged-usr."

Hosts without the /lib -> /usr/lib symlink are exposed to the issue described in this bug. This will occur if the host was originally installed with a level that pre-dates 19.04 (Disco Dingo).

[0] https://lists.ubuntu.com/archives/ubuntu-devel-announce/2018-November/001253.html
[1] https://www.freedesktop.org/wiki/Software/systemd/TheCaseForTheUsrMerge/

Changed in ceph (Ubuntu Jammy):
assignee: nobody → Chris MacNaughton (chris.macnaughton)
Changed in ceph (Ubuntu Kinetic):
assignee: nobody → Chris MacNaughton (chris.macnaughton)
description: updated
Changed in ceph (Ubuntu Jammy):
importance: Undecided → Critical
Changed in ceph (Ubuntu Kinetic):
importance: Undecided → Critical
Revision history for this message
Pedro Victor Lourenço Fragola (pedrovlf) wrote :

Just as information and it was reported by some customers I tested the scenario:

[Test Plan]
Installing Ceph using Bionic / Ceph-Octopus
Upgrade from Bionic to Focal
Upgrade from Ceph Octopus to => Pacific => Quincy using UCA(focal-yoga) repos.
After that the error described in the LP will happen on the clients and ceph-osd, the suggested workaround was applied and the issue was resolved.

This issue needs to be fixed in the UCA repositories as well.

Revision history for this message
Timo Aaltonen (tjaalton) wrote :

the kinetic version is stuck because of FTBFS

Changed in ceph (Ubuntu Jammy):
status: Confirmed → Fix Committed
tags: added: verification-needed verification-needed-jammy
Revision history for this message
Timo Aaltonen (tjaalton) wrote : Please test proposed package

Hello Dan, or anyone else affected,

Accepted ceph into jammy-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/ceph/17.2.0-0ubuntu0.22.04.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-jammy to verification-done-jammy. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-jammy. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Revision history for this message
Johan Hallbäck (johan-hallback) wrote :

This bug affects us, and the proposed workaround setting the osd_class_dir worked.

Installed our system at bionic-ussuri (UCA), upgraded to focal, then continued all the way to focal-yoga (UCA). Ceph packages from UCA are currently version 17.2.0-0ubuntu0.22.04.1~cloud0.

Applying the osd_class_dir workaround and restarting ceph-mgr.target on the ceph-mons also got rid of this problem that appeared after upgrading to focal-yoga and Ceph Quincy:

$ sudo ceph status
  cluster:
    id: cb562c8a-787a-11ec-a57e-00163e0f587a
    health: HEALTH_ERR
            Module 'devicehealth' has failed: disk I/O error

Revision history for this message
Pedro Victor Lourenço Fragola (pedrovlf) wrote :

Is it possible to have the ceph*17.2.0-0ubuntu0.22.04.2 package in UCA for testing?

Changed in cloud-archive:
status: New → Fix Committed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package ceph - 17.2.0-0ubuntu4

---------------
ceph (17.2.0-0ubuntu4) kinetic; urgency=medium

  * d/p/fix-kinetic-libfmt.patch: Apply fixes for libfmt > 8.
  * d/p/fix-kinetic-misc.patch: Misc fixes for Ubuntu Kinetic.

 -- Luciano Lo Giudice <email address hidden> Thu, 22 Sep 2022 17:00:12 +0100

Changed in ceph (Ubuntu Kinetic):
status: Confirmed → Fix Released
Revision history for this message
Pedro Victor Lourenço Fragola (pedrovlf) wrote :

I have tested the UCA proposed packages:

 dpkg -l |grep ceph
ii ceph 17.2.0-0ubuntu0.22.04.2~cloud0 amd64 distributed storage and file system

It works!

root@juju-b47d9d-ceph-11:/etc/apt/sources.list.d# sudo ceph daemon osd.3 config get osd_class_dir
{
    "osd_class_dir": "/usr/lib/x86_64-linux-gnu/rados-classes"
}

tags: added: verification-done verification-done-jammy
removed: verification-needed verification-needed-jammy
Revision history for this message
Brian Murray (brian-murray) wrote : Update Released

The verification of the Stable Release Update for ceph has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package ceph - 17.2.0-0ubuntu0.22.04.2

---------------
ceph (17.2.0-0ubuntu0.22.04.2) jammy; urgency=medium

  * d/p/lp1986747-fix-osd-class-dir.patch: Partially revert upstream
    change that breaks classpath loading (LP: #1986747).

 -- Chris MacNaughton <email address hidden> Thu, 01 Sep 2022 16:30:32 +0100

Changed in ceph (Ubuntu Jammy):
status: Fix Committed → Fix Released
Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

This bug was fixed in the package ceph - 17.2.0-0ubuntu0.22.04.2~cloud0
---------------

 ceph (17.2.0-0ubuntu0.22.04.2~cloud0) focal-yoga; urgency=medium
 .
   * New update for the Ubuntu Cloud Archive.
 .
 ceph (17.2.0-0ubuntu0.22.04.2) jammy; urgency=medium
 .
   * d/p/lp1986747-fix-osd-class-dir.patch: Partially revert upstream
     change that breaks classpath loading (LP: #1986747).

Changed in cloud-archive:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.