Backport of zero-length gc chain fixes to Luminous

Bug #1843085 reported by Kellen Renshaw
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
Invalid
Undecided
Unassigned
Queens
Fix Released
High
Dan Hill
Rocky
Fix Released
High
Unassigned
ceph (Ubuntu)
Invalid
Undecided
Unassigned
Bionic
Fix Released
High
Dan Hill

Bug Description

[Impact]
Cancelling large S3/Swift object puts may result in garbage collection entries with zero-length chains. Rados gateway garbage collection does not efficiently process and clean up these zero-length chains.

A large number of zero-length chains will result in rgw processes quickly spinning through the garbage collection lists doing very little work. This can result in abnormally high cpu utilization and op workloads.

[Test Case]
Modify garbage collection parameters by editing ceph.conf on the target rgw:
```
rgw enable gc threads = false
rgw gc obj min wait = 60
rgw gc processor period = 60
```

Restart the ceph-radosgw service to apply the new configuration:
`sudo systemctl restart ceph-radosgw@rgw.$HOSTNAME`

Repeatedly interrupt 512MB object put requests for randomized object names:
```
for i in {0..1000}; do
  f=$(mktemp); fallocate -l 512M $f
  s3cmd put $f s3://test_bucket --disable-multipart &
  pid=$!
  sleep $((RANDOM % 7 + 3)); kill $pid
  rm $f
done
```

Delete all objects in the bucket index:
```
for f in $(s3cmd ls s3://test_bucket | awk '{print $4}'); do
  s3cmd del $f
done
```

By default rgw_max_gc_objs splits the garbage collection list into 32 shards.
Capture omap detail and verify zero-length chains were left over:
```
export CEPH_ARGS="--id=rgw.$HOSTNAME"
for i in {0..31}; do
  sudo -E rados -p default.rgw.log --namespace gc listomapvals gc.$i
done
```

Confirm the garbage collection list contains expired objects by listing expiration timestamps:
`sudo -E radosgw-admin gc list | grep time; date`

Raise the debug level and process the garbage collection list:
`sudo -E radosgw-admin --debug-rgw=20 --err-to-stderr gc process`

Use the logs to verify the garbage collection process iterates through all remaining omap entry tags. Then confirm all rados objects have been cleaned up:
`sudo -E rados -p default.rgw.buckets.data ls`

[Regression Potential]
Backport has been accepted into the Luminous release stable branch upstream.

[Other Information]
This issue has been reported upstream [0] and was fixed in Nautilus alongside a number of other garbage collection issues/enhancements in pr#26601 [1]:
* adds additional logging to make future debugging easier.
* resolves bug where the truncated flag was not always set correctly in gc_iterate_entries
* resolves bug where marker in RGWGC::process was not advanced
* resolves bug in which gc entries with a zero-length chain were not trimmed
* resolves bug where same gc entry tag was added to list for deletion multiple times

These fixes were slated for back-port into Luminous and Mimic, but the Luminous work was not completed because of a required dependency: AIO GC [2]. This dependency has been resolved upstream, and is pending SRU verification in Ubuntu packages [3].

[0] https://tracker.ceph.com/issues/38454
[1] https://github.com/ceph/ceph/pull/26601
[2] https://tracker.ceph.com/issues/23223
[3] https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1838858

Revision history for this message
Kellen Renshaw (krenshaw) wrote :

It appears that there is an existing backport:
https://tracker.ceph.com/issues/38714

that depends on:
https://tracker.ceph.com/issues/23223

Dan Hill (hillpd)
Changed in ceph (Ubuntu):
assignee: nobody → Dan Hill (hillpd)
Dan Hill (hillpd)
Changed in ceph (Ubuntu):
assignee: Dan Hill (hillpd) → nobody
Changed in ceph (Ubuntu Bionic):
assignee: nobody → Dan Hill (hillpd)
Dan Hill (hillpd)
description: updated
summary: - Need backport of 0-length gc chain fixes to Luminous
+ Backport of zero-length gc chain fixes to Luminous
Revision history for this message
Dan Hill (hillpd) wrote :

pr#30367 [0] is currently pending upstream review, but needs to have build issues resolved.

[0] https://github.com/ceph/ceph/pull/30367

Revision history for this message
Billy Olsen (billy-olsen) wrote :

Adding to UCA queens for luminous backport. Fix is in the mimic series (UCA rocky) already.

Dan Hill (hillpd)
Changed in ceph (Ubuntu Bionic):
status: New → In Progress
Revision history for this message
Dan Hill (hillpd) wrote :

Want to clearly state that while AIO GC is a dependency, these fixes do not address anything introduced by that feature.

The fixes address bugs that existed prior to AIO GC.

James Page (james-page)
Changed in cloud-archive:
status: New → Invalid
Changed in ceph (Ubuntu):
status: New → Invalid
Changed in ceph (Ubuntu Bionic):
importance: Undecided → High
tags: added: sts-sru-needed
Revision history for this message
Dan Hill (hillpd) wrote :

Upstream back-port is being tracked by issue#38714, and the pr#31664 [1] is pending upstream review.

[0] https://tracker.ceph.com/issues/38714
[1] https://github.com/ceph/ceph/pull/31664

James Page (james-page)
description: updated
Revision history for this message
Timo Aaltonen (tjaalton) wrote : Please test proposed package

Hello Kellen, or anyone else affected,

Accepted ceph into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/ceph/12.2.12-0ubuntu0.18.04.4 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in ceph (Ubuntu Bionic):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-bionic
Revision history for this message
James Page (james-page) wrote :

Hello Kellen, or anyone else affected,

Accepted ceph into queens-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:queens-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-queens-needed to verification-queens-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-queens-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-queens-needed
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Hello Kellen, or anyone else affected,

Accepted ceph into queens-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:queens-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-queens-needed to verification-queens-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-queens-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Revision history for this message
James Page (james-page) wrote :

General regression testing of bionic/proposed completed OK:

======
Totals
======
Ran: 92 tests in 668.7828 sec.
 - Passed: 84
 - Skipped: 8
 - Expected Fail: 0
 - Unexpected Success: 0
 - Failed: 0
Sum of execute time for each test: 779.9697 sec.

Dan Hill (hillpd)
description: updated
tags: added: verification-done-bionic
removed: verification-needed-bionic
Dan Hill (hillpd)
description: updated
tags: added: verification-queens-done
removed: verification-queens-needed
Dan Hill (hillpd)
tags: added: verification-done
removed: verification-needed
Revision history for this message
Łukasz Zemczak (sil2100) wrote :

I see there's a very specific test-case here in the bug to verify if the issue is resolved, but I don't see any mention of it being ran as part of verification. Was it part of the general regression testing? Could you perform those steps and only then switch the bug to -verified? Thank you!

tags: added: verification-needed verification-needed-bionic
removed: verification-done verification-done-bionic
Revision history for this message
Dan Hill (hillpd) wrote :

Sorry, I should have clearly indicated that the test cases were exercised.

Verification has been completed on both bionic and queens.

tags: added: verification-done verification-done-bionic
removed: verification-needed verification-needed-bionic
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package ceph - 12.2.12-0ubuntu0.18.04.4

---------------
ceph (12.2.12-0ubuntu0.18.04.4) bionic; urgency=medium

  [ Billy Olsen ]
  * Do not validate fs caps on authorize (LP: #1847822):
    - d/p/dont-validate-fs-caps-on-authorize.patch: Do not validate
      the filesystem caps with a new client connection to the monitor
      when authorizing a client connection.

  [ Dan Hill ]
  * d/p/issue38454.patch: Cherry pick of fixes for misc RGW bugs
    and cleanup of garbage collection code (LP: #1843085).

  [ Dariusz Gadomski ]
  * d/p/issue37490.patch: Cherry pick fix to optimize LVM queries
    in ceph-volume, resolving performance issues in systems under
    heavy load or with large numbers of disks (LP: #1850754).

 -- James Page <email address hidden> Thu, 28 Nov 2019 10:27:34 +0000

Changed in ceph (Ubuntu Bionic):
status: Fix Committed → Fix Released
Revision history for this message
Łukasz Zemczak (sil2100) wrote : Update Released

The verification of the Stable Release Update for ceph has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
James Page (james-page) wrote :

The verification of the Stable Release Update for ceph has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
James Page (james-page) wrote :

This bug was fixed in the package ceph - 12.2.12-0ubuntu0.18.04.4~cloud0
---------------

 ceph (12.2.12-0ubuntu0.18.04.4~cloud0) xenial-queens; urgency=medium
 .
   * New update for the Ubuntu Cloud Archive.
 .
 ceph (12.2.12-0ubuntu0.18.04.4) bionic; urgency=medium
 .
   [ Billy Olsen ]
   * Do not validate fs caps on authorize (LP: #1847822):
     - d/p/dont-validate-fs-caps-on-authorize.patch: Do not validate
       the filesystem caps with a new client connection to the monitor
       when authorizing a client connection.
 .
   [ Dan Hill ]
   * d/p/issue38454.patch: Cherry pick of fixes for misc RGW bugs
     and cleanup of garbage collection code (LP: #1843085).
 .
   [ Dariusz Gadomski ]
   * d/p/issue37490.patch: Cherry pick fix to optimize LVM queries
     in ceph-volume, resolving performance issues in systems under
     heavy load or with large numbers of disks (LP: #1850754).

Revision history for this message
James Page (james-page) wrote :

The verification of the Stable Release Update for ceph has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
James Page (james-page) wrote :

This bug was fixed in the package ceph - 12.2.12-0ubuntu0.18.04.4~cloud0
---------------

 ceph (12.2.12-0ubuntu0.18.04.4~cloud0) xenial-queens; urgency=medium
 .
   * New update for the Ubuntu Cloud Archive.
 .
 ceph (12.2.12-0ubuntu0.18.04.4) bionic; urgency=medium
 .
   [ Billy Olsen ]
   * Do not validate fs caps on authorize (LP: #1847822):
     - d/p/dont-validate-fs-caps-on-authorize.patch: Do not validate
       the filesystem caps with a new client connection to the monitor
       when authorizing a client connection.
 .
   [ Dan Hill ]
   * d/p/issue38454.patch: Cherry pick of fixes for misc RGW bugs
     and cleanup of garbage collection code (LP: #1843085).
 .
   [ Dariusz Gadomski ]
   * d/p/issue37490.patch: Cherry pick fix to optimize LVM queries
     in ceph-volume, resolving performance issues in systems under
     heavy load or with large numbers of disks (LP: #1850754).

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.