Memory leak of struct _virPCIDeviceAddress on libvirt

Bug #1844455 reported by Guilherme G. Piccoli
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
Fix Released
Undecided
Guilherme G. Piccoli
Mitaka
Fix Released
Undecided
Guilherme G. Piccoli
Queens
Fix Released
Undecided
Guilherme G. Piccoli
libvirt (Ubuntu)
Fix Released
High
Guilherme G. Piccoli
Xenial
Fix Released
High
Guilherme G. Piccoli
Bionic
Fix Released
High
Guilherme G. Piccoli
Eoan
Fix Released
High
Guilherme G. Piccoli
Focal
Fix Released
High
Guilherme G. Piccoli

Bug Description

[Impact]
* There's a long-term memory leak in libvirt related to the PCI information gathering from sysfs in Linux, specially related with SR-IOV devices. This was fixed by commit 38816336 ("node_device_conf: Don't leak @physical_function in virNodeDeviceGetPCISRIOVCaps") [ libvirt.org/git/?p=libvirt.git;a=commit;h=38816336 ].

* In comment #9 there is a detailed explanation of what's going on, but the summary is that the variable physical_function (member of a PCI structure), of type _virPCIDeviceAddress, is allocated on virPCIGetDeviceAddressFromSysfsLink() and should be freed before reuse in virNodeDeviceGetPCISRIOVCaps(), but it wasn't before the fix was introduced.

* The impact of the issue is a memory leak usually small but that may grow bigger depending on the amount of PCI devices and how/when they are enumerated by libvirt; if some user of those functions are actively exercising the leak path it may become a problem (OOM situation).

[Test Case]
* The basic testing done to exercise the memory leak path was running the virsh tool to generate the XML output of a SR-IOV PCI device in a loop, like:

while true; do virsh nodedev-dumpxml pci_0000_08_12_0 >/dev/null; done

* This was executed while Valgrind was used to debug libvirtd, in order to collect the signature of the leak. Without the patch we get the "definitely lost" type of leak with the PCI backtrace (on comment #9), whereas with the patch we don't see the leak anymore.

[Regression Potential]
* The potential of regressions is really low - the fix is upstream for a while and in Focal package, and it is self-contained and not intrusive. Considering hypothetical scenarios, if there's an issue with the fix it should come in form of unused memory or double-free (which is usually harmless), and only in PCI enumeration (or PCI XML generation) paths.

Revision history for this message
Guilherme G. Piccoli (gpiccoli) wrote :
Changed in libvirt (Ubuntu Xenial):
assignee: nobody → Guilherme G. Piccoli (gpiccoli)
Revision history for this message
Guilherme G. Piccoli (gpiccoli) wrote :

Leak #1:

==12891== 80 bytes in 1 blocks are definitely lost in loss record 949 of 1,360
==12891== at 0x4C2CC70: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==12891== by 0x50A5299: virAlloc (viralloc.c:144)
==12891== by 0x50C1C63: virLastErrorObject (virerror.c:240)
==12891== by 0x50C4F88: virResetLastError (virerror.c:412)
==12891== by 0x51B786C: virConnectOpen (libvirt.c:1139)

From pahole:

struct _virError {
        int code; /* 0 0x4 */
        int domain; /* 0x4 0x4 */
        char * message; /* 0x8 0x8 */
        virErrorLevel level; /* 0x10 0x4 */

        /* XXX 4 bytes hole, try to pack */

        virConnectPtr conn; /* 0x18 0x8 */
        virDomainPtr dom; /* 0x20 0x8 */
        char * str1; /* 0x28 0x8 */
        char * str2; /* 0x30 0x8 */
        char * str3; /* 0x38 0x8 */
        /* --- cacheline 1 boundary (64 bytes) --- */
        int int1; /* 0x40 0x4 */
        int int2; /* 0x44 0x4 */
        virNetworkPtr net; /* 0x48 0x8 */

        /* size: 80, cachelines: 2, members: 12 */
        /* sum members: 76, holes: 1, sum holes: 4 */
        /* last cacheline: 16 bytes */
};

The _virError struct (in the form of virErrorPtr typedef) is expected to be freed
in virLastErrFreeData(), which is a thread "destructor" set in the pthread
creation. It should be called when thread exists (by pthread_exit() or something
analog), but can be skipped if process main() function returns.
So, hypothesis here are:

1) There was a process exist that led to this thread data getting leaked
2) Valgrind should be ended (with SIGTERM) in order to collect the thread
destructor execution, so this is a false/temporary leak.

Revision history for this message
Guilherme G. Piccoli (gpiccoli) wrote :

Leak #2:

==21823== 385 (280 direct, 105 indirect) bytes in 1 blocks are definitely lost in loss record 88 of 106
==21823== at 0x4C2CC70: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==21823== by 0x50A5299: virAlloc (viralloc.c:144)
==21823== by 0x14B185: daemonConfigNew (libvirtd-config.c:242)

This one is similar to the previous one, it's likely struct daemonConfig is leaked.
This structure is allocated in daemonNew and freed by daemonConfigFree(), at the end
of libvirtd main() function. The hypothesis here again is that we had either a process
killed or we should correctly terminate Valgrind in order it collects the free
calls for this object.

Revision history for this message
Guilherme G. Piccoli (gpiccoli) wrote :

Leak #3
968 bytes in 1 blocks are definitely lost in loss record 1,313 of 1,405
at 0x4C2AB80: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
by 0x6D4BA21: xmlGetGlobalState (in /usr/lib/x86_64-linux-gnu/libxml2.so.2.9.3)
by 0x6D4B214: __xmlIndentTreeOutput (in /usr/lib/x86_64-linux-gnu/libxml2.so.2.9.3)
by 0x5149F7D: virDomainDefFormatInternal (domain_conf.c:21655)
by 0x514D0F2: virDomainDefFormat (domain_conf.c:22517)
by 0x515B93B: virDomainDefCopy (domain_conf.c:23041)
by 0x515B9FF: virDomainObjSetDefTransient (domain_conf.c:2818)
by 0x2B70CCD9: qemuProcessInit (qemu_process.c:4483)
by 0x2B70CFEE: qemuProcessStart (qemu_process.c:5150)
by 0x2B76D727: qemuDomainObjStart.constprop.47 (qemu_driver.c:7396)
by 0x2B76DE65: qemuDomainCreateWithFlags (qemu_driver.c:7450)
by 0x51CC7FB: virDomainCreate (libvirt-domain.c:6753)

I've checked the code in libxml2, and what is leaking is the struct _xmlGlobalState,
in function xmlNewGlobalState(), called from xmlGetGlobalState() - this is part
of thread orchestration mechanism in libxml. Using pahole I could check the struct
size, and it matches:

$ pahole --hex -C _xmlGlobalState <libxml_dbg_file> | grep size
/* size: 968, cachelines: 16, members: 33 */

The struct is freed in xmlFreeGlobalState(), when thread exits. This is again a case
when there's a cleanup routine that seems to be skipped; we need to understand really
if the process is being killed without having chance to run these cleanups.

I'm preparing a test package to validate the cleanups execution.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hi,
thanks for the report and your work on it.
Are you sure those issues are not happening on newer versions?

Because
a) if they are fixed in newer versions then I'd ask you to backport the respective fixes from there (can be partial if they were bigger reworks, but always better than starting from scratch).
b) if they are not fixed in newer versions we should address and fix those as well (For SRU policy and long term viability)

So after your further analysis of the case give the new versions a check in that regard please.

Changed in cloud-archive:
assignee: nobody → Guilherme G. Piccoli (gpiccoli)
status: New → Confirmed
Revision history for this message
Guilherme G. Piccoli (gpiccoli) wrote :

Hi Christian, thanks for your comment! I didn't receive an email with that comment, unfortunately.
I'm not sure if they happen on upstream version, I've tried already and saw different leaks, potentially correlated with the virErr ones.

I couldn't see libxml2 leaks. But i continue the investigation, thanks!

Revision history for this message
Guilherme G. Piccoli (gpiccoli) wrote :

Just a minor update here, we found one leak to be fixed by the following commit:
libvirt.org/git/?p=libvirt.git;a=commit;h=38816336a5 .
We continue the investigation of the other leaks.

Cheers,

Guilherme

Revision history for this message
Guilherme G. Piccoli (gpiccoli) wrote :

We validated that commit 38816336a5 ("node_device_conf: Don't leak @physical_function in virNodeDeviceGetPCISRIOVCaps") [0] indeed fixes the leak investigated. Although there are more definitely-lost memory reports from Valgrind, they are ultimately glibc-related and given the report was in Trusty and they are not considerable leaks (at most, 8K/24H) our focus will be to fix the PCI-related leak in all libvirt releases.

SRU template and debdiffs will get added here soon.
Cheers,

Guilherme

[0] libvirt.org/git/?p=libvirt.git;a=commit;h=38816336a5

Changed in libvirt (Ubuntu Eoan):
status: New → Confirmed
Changed in libvirt (Ubuntu Bionic):
status: New → Confirmed
Changed in libvirt (Ubuntu Xenial):
status: New → Confirmed
Changed in libvirt (Ubuntu Eoan):
importance: Undecided → High
Changed in libvirt (Ubuntu Bionic):
importance: Undecided → High
Changed in libvirt (Ubuntu Xenial):
importance: Undecided → High
Changed in libvirt (Ubuntu Bionic):
assignee: nobody → Guilherme G. Piccoli (gpiccoli)
Changed in libvirt (Ubuntu Eoan):
assignee: nobody → Guilherme G. Piccoli (gpiccoli)
summary: - Memory leak on libvirt 1.3.1
+ Memory leak of struct _virPCIDeviceAddress on libvirt
Changed in libvirt (Ubuntu Focal):
status: Confirmed → Fix Released
Revision history for this message
Guilherme G. Piccoli (gpiccoli) wrote :
Download full text (5.5 KiB)

The function that allocates the structure _virPCIDeviceAddress is: virPCIGetDeviceAddressFromSysfsLink().

This function is called in the following path, specifically in virPCIGetPhysicalFunction(), which is itself called from nodeDeviceSysfsGetPCISRIOVCaps().

[0] [1] [2]
-----------------------------------
                 |
                 |
nodeDeviceSysfsGetPCIRelatedDevCaps()
  nodeDeviceSysfsGetPCISRIOVCaps()

Since the tree is a bit large, I've split in 3 parts: [0], [1] and [2].
In that terminology, the bottom function in each stack calls nodeDeviceSysfsGetPCIRelatedDevCaps().

[0] src/node_device/node_device_driver.c

virNodeDeviceGetXMLDesc() [libvirt-nodedev.c]
  virNodeDeviceDriver node_device_driver callback <.nodeDeviceGetXMLDesc(), in udev/hal/remote>
    nodeDeviceGetXMLDesc()
      update_caps()

[1] src/node_device/node_device_udev.c
   virStateDriver udevStateDriver [callback register, stateInitialize()]
                               |
            --------------------------------------
            | |
            | nodeStateInitialize()
            | |
   nodeStateInitialize() udevEnumerateDevices()
               | |
udevEventHandleCallback() || udevProcessDeviceListEntry
  udevAddOneDevice()
    udevGetDeviceDetails()
      udevProcessPCI()

[2] src/node_device/node_device_hal.c << deprecated >>
virStateDriver halStateDriver callbacks register [stateInitialize(), stateInitialize()]
                      | libhal_ctx_set_device_added() |-> multiple calls in this file
         ----------------------- | |
         | | | |
nodeStateInitialize() || nodeStateReload || device_added() || dev_refresh()
         |
      dev_create() libhal_ctx_set_device_new_capability() <HAL callback>
         | |
gather_capabilities() || device_cap_added()
  gather_capability()
    caps_tbl[VIR_NODE_DEV_CAP_PCI_DEV]() <gather_fn ptr>
      gather_pci_cap()

We can skip the path [2] given that HAL is deprecated and only udev is available on Ubuntu. Trying to follow the path [1] led us to exercise the allocation of the structure responsible for the leak in question, but at the same time, udevEventHandleCallback() is able to deallocate the variable.
The trigger is the following:

while true; do udevadm trigger; done

It goes to the aforementioned described udev path ([1]), and allocates one instance of the PCI _virPCIDeviceAddress structure per device; it is deallocated though according to the following backtrace collected with gdb:

#0 virNodeDevCapsDefFree (caps=0x558d824d6ba0) at ../../../src/conf/node_device_conf.c:1709
#1 0x00007f7f1244db6c in virNodeDeviceDefFree (def=0x558d824c29c0) at ../../../src/conf/node_device_conf.c:146
#2 0x00007f7f1244fe99 in virNodeDeviceAssignDef (devs=0x7f7ee00e68b8, def=0x558d824cde50)
    at ../../../src/conf/node_device_conf.c:182
#3 0x00007f7eebe1efae in udevAddOneDevice (device=device@entry=0x558d824d85e0)
 ...

Read more...

Revision history for this message
Guilherme G. Piccoli (gpiccoli) wrote :

Formatting was severely impaired on last comment, so here's a more 'visual" diagram set: https://pastebin.ubuntu.com/p/Qps5g2gmsv

Cheers,

Guilherme

Revision history for this message
Guilherme G. Piccoli (gpiccoli) wrote :
description: updated
Revision history for this message
Guilherme G. Piccoli (gpiccoli) wrote :
Revision history for this message
Guilherme G. Piccoli (gpiccoli) wrote :
Changed in libvirt (Ubuntu Eoan):
status: Confirmed → In Progress
Changed in libvirt (Ubuntu Bionic):
status: Confirmed → In Progress
Changed in libvirt (Ubuntu Xenial):
status: Confirmed → In Progress
tags: added: sts-sponsor-dgadomski
Revision history for this message
Guilherme G. Piccoli (gpiccoli) wrote :

The patch for LP #1864918 (Xenial-only) is being also uploaded here, in the Xenial debdiff.
Thanks Dariusz for sponsoring this one!
Cheers,

Guilherme

Revision history for this message
Brian Murray (brian-murray) wrote : Please test proposed package

Hello Guilherme, or anyone else affected,

Accepted libvirt into eoan-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/libvirt/5.4.0-0ubuntu5.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-eoan to verification-done-eoan. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-eoan. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in libvirt (Ubuntu Eoan):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-eoan
Changed in libvirt (Ubuntu Bionic):
status: In Progress → Fix Committed
tags: added: verification-needed-bionic
Revision history for this message
Brian Murray (brian-murray) wrote :

Hello Guilherme, or anyone else affected,

Accepted libvirt into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/libvirt/4.0.0-1ubuntu8.15 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Revision history for this message
Robie Basak (racb) wrote :

Hello Guilherme, or anyone else affected,

Accepted libvirt into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/libvirt/1.3.1-1ubuntu10.30 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-xenial to verification-done-xenial. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-xenial. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in libvirt (Ubuntu Xenial):
status: In Progress → Fix Committed
tags: added: verification-needed-xenial
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Hello Guilherme, or anyone else affected,

Accepted libvirt into queens-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:queens-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-queens-needed to verification-queens-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-queens-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-queens-needed
Changed in cloud-archive:
status: Confirmed → In Progress
Changed in cloud-archive:
status: In Progress → Fix Committed
Revision history for this message
Guilherme G. Piccoli (gpiccoli) wrote :

I've managed to verify the following proposed versions of libvirt:

* Eoan: 5.4.0-0ubuntu5.1
* Bionic: 4.0.0-1ubuntu8.15
* Xenial/Queens: 4.0.0-1ubuntu8.15~cloud0
* Xenial: 1.3.1-1ubuntu10.30
* Trusty/Mitaka: 1.3.1-1ubuntu10.30~cloud0

In all cases, Valgrind approach was used to verify the leak was absent in the -proposed versions; tests with the current / -updates version were also performed as reference, and the leak was observed in those.
Hence, I'm marking this LP as verified for all releases.
Thanks,

Guilherme

tags: added: verification-done verification-done-bionic verification-done-eoan verification-done-xenial verification-mitaka-done verification-queens-done
removed: verification-needed verification-needed-bionic verification-needed-eoan verification-needed-xenial verification-queens-needed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package libvirt - 5.4.0-0ubuntu5.1

---------------
libvirt (5.4.0-0ubuntu5.1) eoan; urgency=medium

  * d/p/lp-1844455-node_device_conf-Don-t-leak-physical_function.patch:
    fix memory-leak from PCI-related structure. (LP: #1844455)

 -- <email address hidden> (Guilherme G. Piccoli) Thu, 20 Feb 2020 12:35:23 -0300

Changed in libvirt (Ubuntu Eoan):
status: Fix Committed → Fix Released
Revision history for this message
Łukasz Zemczak (sil2100) wrote : Update Released

The verification of the Stable Release Update for libvirt has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package libvirt - 4.0.0-1ubuntu8.15

---------------
libvirt (4.0.0-1ubuntu8.15) bionic; urgency=medium

  * d/p/lp-1844455-node_device_conf-Don-t-leak-physical_function.patch:
    fix memory-leak from PCI-related structure. (LP: #1844455)

 -- <email address hidden> (Guilherme G. Piccoli) Thu, 20 Feb 2020 13:07:33 -0300

Changed in libvirt (Ubuntu Bionic):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package libvirt - 1.3.1-1ubuntu10.30

---------------
libvirt (1.3.1-1ubuntu10.30) xenial; urgency=medium

  * d/p/lp-1844455-node_device_conf-Don-t-leak-physical_function.patch:
    fix memory-leak from PCI-related structure. (LP: #1844455)
  * d/p/lp-1864918-Fix-TLS-test-suites-with-gnutls-3.6.0.patch: fix failing TLS
    tests due to recent-introduced SHA1 restriction in gnutls. (LP: #1864918)

 -- <email address hidden> (Guilherme G. Piccoli) Wed, 26 Feb 2020 13:23:18 -0300

Changed in libvirt (Ubuntu Xenial):
status: Fix Committed → Fix Released
Revision history for this message
Corey Bryant (corey.bryant) wrote :

The verification of the Stable Release Update for libvirt has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

This bug was fixed in the package libvirt - 4.0.0-1ubuntu8.15~cloud0
---------------

 libvirt (4.0.0-1ubuntu8.15~cloud0) xenial-queens; urgency=medium
 .
   * New update for the Ubuntu Cloud Archive.
 .
 libvirt (4.0.0-1ubuntu8.15) bionic; urgency=medium
 .
   * d/p/lp-1844455-node_device_conf-Don-t-leak-physical_function.patch:
     fix memory-leak from PCI-related structure. (LP: #1844455)

Changed in cloud-archive:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.