MAAS 3.2 deb package memory leak after upgrading

Bug #2012596 reported by German Mazurenko
28
This bug affects 4 people
Affects Status Importance Assigned to Milestone
MAAS
Fix Committed
High
Seyeong Kim
3.2
Fix Committed
High
Seyeong Kim
3.3
Fix Committed
High
Seyeong Kim
3.4
Fix Committed
High
Unassigned
3.5
Fix Committed
High
Unassigned

Bug Description

I’m using deb package version 3.2.7 with region controller, rack controller and postgres 14 on it’s own 20.04 virtual machine each(before upgrading it was 18.04 and pg 9.5).
After upgrading from 2.8 to 3.2.6(upgrading from 3.2.6 to 3.2.7 made no difference) I noticed that region and rack controllers started uncontrollably increasing their memory usage for ~1GB per day(or slower on my playground with less bare metal and virtual servers), until OOM-killer comes or System Administrator restarts it.
I tried to debug it myself with tracemalloc, pympler, etc but unsuccessfully.

Repo I'm using: http://ppa.launchpad.net/maas/3.2/ubuntu focal main

Tags: sts

Related branches

affects: maas → maas (Ubuntu)
description: updated
Revision history for this message
Anton Troyanov (troyanov) wrote (last edit ):

Hi German,

Thats interesting and I don't think I ever noticed the same behaviour.

1. Does this memory leak happens if you make a clean 3.2.6 install?
2. If you make a clean 2.8 install and upgrade it to 3.2.6 will it leak?
3. Any noticeable errors in MAAS logs?
https://maas.io/docs/how-to-review-and-report-bugs#heading--logfiles

4. What did tracemalloc show? Would be great if you can provide top 10 from tracemalloc
https://docs.python.org/3/library/tracemalloc.html#display-the-top-10

Revision history for this message
German Mazurenko (german-m) wrote (last edit ):
Download full text (4.7 KiB)

Hi Anton,
> 1. Does this memory leak happens if you make a clean 3.2.6 install?
I added freshly installed region and rack controller to my test setup, but they leaking too.
> 2. If you make a clean 2.8 install and upgrade it to 3.2.6 will it leak?
Yes
> 3. Any noticeable errors in MAAS logs?
Unfortunately, no
> 4. What did tracemalloc show? Would be great if you can provide top 10 from tracemalloc

I didn't save tracemalloc output because it wasn't helpful(and there is strange behavior: when I include tracemalloc leaking... gone)
Instead, there is pympler output which is also not so helpful but proves my problem(leackage is slower because it's test installation for debug with only two Virsh KVM and 11 VM's inside):

region, 03:46:15
                        types | # objects | total size
 ============================ | =========== | ============
                          str | 178920 | 19.41 MB
                         list | 77308 | 18.18 MB
                         dict | 56909 | 16.56 MB
                         type | 6794 | 6.51 MB
                         code | 33649 | 5.80 MB
                        tuple | 48106 | 3.10 MB
                          set | 2498 | 954.92 KB
                      weakref | 11256 | 791.44 KB
                          int | 25271 | 699.57 KB
          function (__init__) | 2695 | 357.93 KB
                         cell | 9127 | 356.52 KB
            getset_descriptor | 5594 | 349.62 KB
                    frozenset | 1022 | 324.08 KB
                  abc.ABCMeta | 314 | 321.24 KB
   builtin_function_or_method | 4046 | 284.48 KB

region, 09:49:15

                     types | # objects | total size
 ========================= | =========== | ============
                      list | 197328 | 60.93 MB
                       str | 298641 | 28.07 MB
                      dict | 60274 | 17.41 MB
                      type | 6794 | 6.51 MB
                      code | 33649 | 5.84 MB
                     tuple | 48836 | 3.15 MB
                       int | 50384 | 1.35 MB
                       set | 2786 | 1016.17 KB
                   weakref | 11259 | 791.65 KB
                      cell | 9681 | 378.16 KB
       function (__init__) | 2695 | 357.93 KB
         getset_descriptor | 5594 | 349.62 KB
   collections.OrderedDict | 724 | 335.00 KB
                 frozenset | 1022 | 324.08 KB
               abc.ABCMeta | 314 | 321.24 KB

rack, 03:45:24:

                        types | # objects | total size
 ============================ | =========== | ============
                          str | 107517 | 11.20 MB
                         list | 44329 | 10.19 MB
                         dict | 28532 | 8.87 MB
                         type | 4264 | 3.93 MB
                         code | 21193 | 3.68 MB
                        tuple | 24374 | ...

Read more...

no longer affects: maas
affects: maas (Ubuntu) → maas
Revision history for this message
Anton Troyanov (troyanov) wrote :

German, is it possible to try snap version of MAAS?

I am also wondering if MAAS without any KVMs and VMs also leaks?
And you've mentioned OOM-killer, but do you know how much memory MAAS consumed by that time?

Changed in maas:
status: New → Incomplete
Revision history for this message
German Mazurenko (german-m) wrote :

Another symptom:
If rackd catches OOM or restart, regiond frees it's "extra" memory.

Changed in maas:
status: Incomplete → New
status: New → Incomplete
Revision history for this message
German Mazurenko (german-m) wrote :

> is it possible to try snap version of MAAS?
no

>I am also wondering if MAAS without any KVMs and VMs also leaks?
not sure, but seems like no.

>And you've mentioned OOM-killer, but do you know how much memory MAAS consumed by that time?
~3.5-4GB by maas-rackd

Changed in maas:
status: Incomplete → New
Revision history for this message
Anton Troyanov (troyanov) wrote :

German, just to double check:

You have 20.04 running inside VM, right?
Do you have anything else beside MAAS installed there that could interfere with MAAS dependencies?

Do you have prometheus mertics enabled?
https://maas.io/docs/deb/2.7/ui/prometheus-metrics#heading--enabling-prometheus-endpoints

Changed in maas:
status: New → Incomplete
Revision history for this message
German Mazurenko (german-m) wrote :

>You have 20.04 running inside VM, right?
Yes

> Do you have anything else beside MAAS installed there that could interfere with MAAS dependencies?
No. Just one component on each VM

>Do you have prometheus mertics enabled?
Prometheus metrics is not enabled

maas admin maas get-config name=prometheus_enabled
Success.
Machine-readable output follows:
false

Revision history for this message
German Mazurenko (german-m) wrote (last edit ):

New info:
1. attaching tracemalloc to rackd doesn't affect leaking. Also when rackd used ~500MB tracemalloc.get_traced_memory() showed only 80.
2. maas-proxy also leaks(from 110MB at start to 500MB after 3 days)

Revision history for this message
German Mazurenko (german-m) wrote :

3.1 has the same problem

Revision history for this message
German Mazurenko (german-m) wrote :

3.0/deb is leaking too.

Revision history for this message
German Mazurenko (german-m) wrote :

After some extra research, I found evidence, that changes that causes to memory leak is between 2.9 and 3.0 maas versions (I've excluded os, db and network possible misconfigurations)

Revision history for this message
Alexsander de Souza (alexsander-souza) wrote :

Hi,

can you send us a SOS report of these VMs?

$ apt-get install sosreport
$ sos report (or sosreport, depending on the Ubuntu version)

Run this on both region and rack controllers, and send us the resulting file. We recommend that you do this with MAAS 3.2, as this is the oldest version supported by us.

Changed in maas:
status: Incomplete → New
status: New → Incomplete
Revision history for this message
German Mazurenko (german-m) wrote :

sos report output:

 Setting up archive ...
 Setting up plugins ...
[plugin:firewall_tables] skipped command 'nft list ruleset': required kmods missing: nf_tables. Use '--allow-system-changes' to enable collection.
[plugin:firewall_tables] skipped command 'ip6tables -vnxL': required kmods missing: ip6table_filter, nf_tables.
[plugin:networking] skipped command 'ip -s macsec show': required kmods missing: macsec. Use '--allow-system-changes' to enable collection.
[plugin:networking] skipped command 'ss -peaonmi': required kmods missing: unix_diag, af_packet_diag, netlink_diag, xsk_diag, udp_diag. Use '--allow-system-changes' to enable collection.
[plugin:wireless] skipped command 'iw list': required kmods missing: cfg80211.
[plugin:wireless] skipped command 'iw dev': required kmods missing: cfg80211.
[plugin:wireless] skipped command 'iwconfig': required kmods missing: cfg80211.
[plugin:wireless] skipped command 'iwlist scanning': required kmods missing: cfg80211.
 Running plugins. Please wait ...

  Finishing plugins [Running: freeipmi]
 Plugin freeipmi timed out

Creating compressed archive...

Revision history for this message
German Mazurenko (german-m) wrote :
Revision history for this message
German Mazurenko (german-m) wrote :
Revision history for this message
Alberto Donato (ack) wrote :

Does the MAAS machine have any system-wide installed python modules that don't come from debs?

Also could you please attach the output of `dpkg -l | grep python3`?

Changed in maas:
status: Incomplete → New
status: New → Incomplete
Revision history for this message
German Mazurenko (german-m) wrote (last edit ):

Sorry for slow response, I was on vacation.

>Does the MAAS machine have any system-wide installed python modules that don't come from debs

One setup with modules installed from pip(debug tools mostly) and one without(both have leacking problem).

>Also could you please attach the output of `dpkg -l | grep python3`?

Attached output(this comment and next) from more recent setup(without pip modules and debug tools), that reproduces problem.

Revision history for this message
German Mazurenko (german-m) wrote :
Revision history for this message
German Mazurenko (german-m) wrote :

Tested snap. Snap not leaking.

Now I'm learning how to use snap in production

Alberto Donato (ack)
Changed in maas:
status: Incomplete → New
status: New → Incomplete
Revision history for this message
German Mazurenko (german-m) wrote :

Accidentaly I checked one of my upgraded maas installations and found that it stops leacking two weeks ago, after another one OOM-kill. Between last two restarts there was 2 unnatended upgrades:
1. linux kernel from 5.4.0.150.148 to 5.4.0.152.149 and libx11-6 from 2:1.6.9-2ubuntu1.2 to 2:1.6.9-2ubuntu1.5
2. bind9 from 1:9.16.1-0ubuntu2.12 to 1:9.16.1-0ubuntu2.15

Looks like it was one of them, but I can't figure out, which one

Revision history for this message
Alberto Donato (ack) wrote :

German, thanks for your report.

I'll close this bug then, feel free to reopen if it happens again and you have additional info.

Changed in maas:
status: Incomplete → Invalid
Revision history for this message
German Mazurenko (german-m) wrote (last edit ):

Looks like it's not over.
After I wrote last comment, I caught OOM on my rack machine(with disabled unattended upgrades) and decided to perform apt upgrade and reboot on both controllers. After 3 days, there is OOM on region controller once again(however this time it was burst of memory usage instead of linear leak)

Changed in maas:
status: Invalid → New
Revision history for this message
Bill Wear (billwear) wrote :

hey German,

thanks for popping this back to TOS. let me see if i can get on board, here:

so after upgrading from version 2.8 to 3.2.7, the region and rack controllers experience uncontrollable memory increases, leading to OOM events or requiring manual restarts.

humor me: how'd you upgrade MAAS from 2.8 to 3.2.7, exactly?

are there any details about your region and rack controllers (virt env, NW settings, customisations) that you maybe haven't mentioned yet?

can we get another set of logs when this happens? have you thought about using tracemalloc or pympler, if you can?

are you saying that the Snap version doesn't leak memory? that's interesting.

i'm totally unsure what to do with this one atm, except i am sure it shouldn't just sit here. we should figure out if we have a boomerang heading back for us.

thanks again for bouncing back.

Bill Wear (billwear)
Changed in maas:
status: New → Incomplete
Revision history for this message
Anton Troyanov (troyanov) wrote :

Hi German,

Sorry for a long response.

We've recently found a bug that could potentially lead to a memory leak.
https://bugs.launchpad.net/maas/+bug/2029417

Currently there is a fix for 3.2 (however it is not released yet)
https://git.launchpad.net/maas/commit/?id=cede3e0da9c9a7c6644f3e57e713f29091e2d5c8

Since you've mentioned you are using DEB package, I am wondering if you can try this patch and see if it solves the issue on you end?

Changed in maas:
status: Incomplete → New
status: New → Incomplete
Revision history for this message
German Mazurenko (german-m) wrote :

Hi Anton,
I've builded and installed maas 3.2.10a1 and now i need 2-3 days to see if that change fixes the problem.

Changed in maas:
status: Incomplete → New
status: New → Incomplete
Revision history for this message
German Mazurenko (german-m) wrote :

Didn't work

Revision history for this message
Alberto Donato (ack) wrote :

Could you please try upgrading to 3.3 (from https://launchpad.net/~maas/+archive/ubuntu/3.3) and see if the behavior persists?

Changed in maas:
status: Incomplete → New
status: New → Incomplete
Revision history for this message
Yuriy Tabolin (olddanmer) wrote :

I have the same issue on 3.3.4 deb (and before had on 3.1). I created a restart maas-regiond cron task, doing it weekly solves our problem.

I've attached a screenshot of how memory leak is visible on prometheus node-exporter memory metrics.

Alberto Donato (ack)
tags: added: bug-council
Revision history for this message
Jerzy Husakowski (jhusakowski) wrote :

A few questions that will help us identify the root cause.
1) Are you using custom images?
2) Could you upload a profile for rackd and regiond from the period where you observe the memory leaking? A few minutes should be sufficient.
3) Could you share anything about how that MAAS instance is used? Is it driven by automation scripts that make API calls, is it mainly left idle, etc?
4) Do you see the same leak with prometheus metrics enabled and disabled?
5) Could you share the full output of `sudo netstat -atnp` at the start and at the end of the experiment?
6) Can you share region logs? The sosreports don't seem to have them.

You can toggle profile capture by issuing the command `pkill -SIGUSR1 -f maas-regiond`, sending the same signal again will stop the capture and saves the profile. The profile file will be stored in the /var/lib/maas/profiling and the logs will also contain the location and name of the file.

Revision history for this message
German Mazurenko (german-m) wrote (last edit ):

Right now we're moved our production to snap and therefore I can't answer some or your questions for next month.

> 1) Are you using custom images?
Only for controller's vm's, but there not so much difference. But we're using custom curtin with prometheus-node-exporter"installation and with custom partiotioning and software raids(md) on our kvm-hosters

> 3) Could you share anything about how that MAAS instance is used? Is it driven by automation scripts that make API calls, is it mainly left idle, etc?
We're using MAAS+JUJU and our production MAAS instance receives API requests every 30 minutes(/MAAS/api/2.0/machines/ and /MAAS/api/2.0/pods/) but our test MAAS installation was leaking without those API calls(only JUJU)

> 4) Do you see the same leak with prometheus metrics enabled and disabled?
Prometheus metrics were disabled all along

Revision history for this message
Jacopo Rota (r00ta) wrote :

I'm trying to reproduce this on deb 3.3. Keep you posted

tags: removed: bug-council
Revision history for this message
Jacopo Rota (r00ta) wrote :

I forgot to post here the result of my first attempt: I can see squid is using more and more memory, but I could not reproduce such a constant grow.

Changed in maas:
status: Incomplete → New
status: New → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for MAAS because there has been no activity for 60 days.]

Changed in maas:
status: Incomplete → Expired
Revision history for this message
Igor Gnip (igorgnip) wrote :

Maas 3.2.10 regiond now needs more than 32 GB of ram per VM and still OOM engages from time to time, it used to work well with 8 GB on 2.6 and with 16 on 2.9.

I am not convinced this should expire.

Revision history for this message
Alan Baghumian (alanbach) wrote :

Hi Igor!

I believe this behavior requires a specific series of events to happen.

I have a 4 node cluster (2 Region + 2 Rack) and this happened to me last week. The primary Region controller's memory usage went out of control causing a OOM crash.

I'm still not quite sure what led to it, but it happened during a custom image sync process.

This cluster has 4GB RAM on Region nodes and 2GB RAM on Rack nodes and has been setup like this since MAAS 3.1 but this is the first time I see this memory leak issue.

Just to add to the context, I do have a 3 node HA proxy setup (pacemaker/corosync) in front of MAAS that also does the SSL termination and all Rack and Region controllers have been configured in rackd.conf and regiond.conf to point to HA Proxy's FQDN.

I'll try to poke around to see if I can reproduce this.

Best,
Alan

Revision history for this message
Seyeong Kim (seyeongkim) wrote :

Hello,

I've set up a test environment and have been making API calls using the command below:

curl --header "Authorization: OAuth oauth_version=1.0, oauth_signature_method=PLAINTEXT, oauth_consumer_key=$API_KEY1, oauth_token=$API_KEY2, oauth_signature=&$API_KEY3, oauth_nonce=$(uuidgen), oauth_timestamp=$(date +%s)" $MAAS_URL/MAAS/api/2.0/machines/

This has been running for about a day.

Initially, it began with around 130,000kb in RSS (as per ps), but now it has increased to around 400,000kb. It appears that there may be a memory leak in that part of the system.

Here's what I've found during my testing:

I removed all services in eventloop.py and tested only service_monitor. It was ok.

I called maas cli to check machines, Bash encountered out-of-memory (OOM) errors three times in several attempts. Strangely, it was fine when running Bash 'for' 10,000 times, but after completing or restarting it, it suddenly faced OOM errors, exhausting all memory in the VM.

With the maas api, I noticed a gradual increase in memory usage over time.

Please give me any advice how to solve this issue. I'll also check maas code if there is something I can do.

Thanks.

Seyeong Kim (seyeongkim)
tags: added: sts
Revision history for this message
Seyeong Kim (seyeongkim) wrote :

Weird thing is that after I isolated api, I can't see that much memory leak.
I see some loop uses memory when they access django model object. but it is not always happening.
I still think somewhere django model object has issue but can't find concrete clue yet.
Any advices?
Thanks a lot.

Revision history for this message
Seyeong Kim (seyeongkim) wrote :

I've checked the process with memray, and something found.

python lib apt could have memory leak.

please refer to this. this explains snap version doesn't have issue.

https://lists.debian.org/deity/2014/04/msg00092.html

Revision history for this message
Seyeong Kim (seyeongkim) wrote :

If I remove the contents in provisioningserver/utils/deb.py

to

// return string instead of using apt_pkg library
def get_deb_versions_info(apt_pkg=None) -> Optional[DebVersionsInfo]:
    ...
    return DebVersionsInfo(
        current=DebVersion(
            version='1:3.2.10-12065-g.0093bc7ec-0ubuntu1~20.04.1',
            origin='http://ppa.launchpad.net/maas/3.2/ubuntu/ focal/main'
        ),
        update=None
    )

memory is not leaking.

Revision history for this message
Igor Gnip (igorgnip) wrote :

Hello, is this bug now EXPIRED or is it going to be fixed ?

Revision history for this message
Seyeong Kim (seyeongkim) wrote :

@igorgnip

no I'm working on it. I changed it's status

Thanks.

Changed in maas:
status: Expired → In Progress
assignee: nobody → Seyeong Kim (seyeongkim)
Changed in maas:
milestone: none → 2.8.9
status: In Progress → Fix Committed
Changed in maas:
importance: Undecided → High
milestone: 2.8.9 → 3.6.0
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Targeting the bug down to MAAS 3.2 per internal ticket from Seyeong (backports underway) and conversation with Jerzy.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.