MAAS slow performance + growing database

Bug #1830365 reported by Peter Sabaini
36
This bug affects 7 people
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
High
Unassigned

Bug Description

On a long running installation I'm seeing performance problems when commissioning nodes.

The installation has 130+ nodes and is running MAAS 2.3.5 (6511-gf466fdb-0ubuntu1) under xenial.

When commissioning 4 nodes the web interface will become unresponsive (taking a node overview page to render 15s+), and a "maas <profile> node read xxx" taking up to 9s instead of 2s for the idle case.

One thing I'm noticing is a lot of "maas@maasdb ERROR: could not serialize access due to concurrent update" lines in postgres log.

Looking at the database seems that some tables in postgres have become quite big, esp. the maasserver_event table plus index:

maasdb=# SELECT nspname || '.' || relname AS "relation",
maasdb-# pg_size_pretty(pg_relation_size(C.oid)) AS "size"
maasdb-# FROM pg_class C
maasdb-# LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
maasdb-# WHERE nspname NOT IN ('pg_catalog', 'information_schema')
maasdb-# ORDER BY pg_relation_size(C.oid) DESC
maasdb-# LIMIT 20;
                       relation | size
------------------------------------------------------+---------
 public.maasserver_event | 2472 MB
 public.maasserver_event_node_id_3f03c875fc2d72eb_idx | 726 MB
 public.maasserver_event_94757cae | 487 MB
 public.maasserver_event_c693ebc8 | 484 MB
 public.maasserver_event__created | 386 MB
 public.maasserver_event_pkey | 385 MB

maasdb=# SELECT *, pg_size_pretty(total_bytes) AS total
    , pg_size_pretty(index_bytes) AS INDEX
    , pg_size_pretty(toast_bytes) AS toast
    , pg_size_pretty(table_bytes) AS TABLE
  FROM (
  SELECT *, total_bytes-index_bytes-COALESCE(toast_bytes,0) AS table_bytes FROM (
      SELECT c.oid,nspname AS table_schema, relname AS TABLE_NAME
              , c.reltuples AS row_estimate
              , pg_total_relation_size(c.oid) AS total_bytes
              , pg_indexes_size(c.oid) AS index_bytes
              , pg_total_relation_size(reltoastrelid) AS toast_bytes
          FROM pg_class c
          LEFT JOIN pg_namespace n ON n.oid = c.relnamespace
          WHERE relkind = 'r' AND relname like 'maasserver_event%'
  ) a
) a;
  oid | table_schema | table_name | row_estimate | total_bytes | index_bytes | toast_bytes | table_bytes | total | index | toast | table
-------+--------------+----------------------+--------------+-------------+-------------+-------------+-------------+---------+---------+------------+---------
 19125 | public | maasserver_event | 1.64169e+07 | 5180358656 | 2587885568 | 8192 | 2592464896 | 4940 MB | 2468 MB | 8192 bytes | 2472 MB
 19136 | public | maasserver_eventtype | 42 | 114688 | 65536 | 8192 | 40960 | 112 kB | 64 kB | 8192 bytes | 40 kB

From inspecting that table it seems that this table keeps all events since initial installation. Also, that table defines quite a few btree indexes, which could also impact performance on updates/inserts.

Do you think this could explain the sluggishness when commissioning? Should those tables be trimmed as part of a regular maintenance?

Related branches

Revision history for this message
Peter Sabaini (peter-sabaini) wrote :

I'm uploading logs here (sorry Canonical only):
https://private-fileshare.canonical.com/~sabaini/lp1830365/

Also the above queries with slightly less horrible formatting:
https://paste.ubuntu.com/p/nSB5kPv3n3/

Lee Trager (ltrager)
Changed in maas:
status: New → In Progress
importance: Undecided → High
assignee: nobody → Lee Trager (ltrager)
milestone: none → 2.3.6
Revision history for this message
Lee Trager (ltrager) wrote :

I've backported a number of performance improvements from master to 2.3 which should significantly reduce memory usage over the websocket. You can test[1] which has all patches applied.

[1] https://code.launchpad.net/~ltrager/maas/+git/maas/+merge/368115

Changed in maas:
status: In Progress → Fix Committed
Changed in maas:
assignee: Lee Trager (ltrager) → nobody
Revision history for this message
Jeff Lane  (bladernr) wrote :

Hey, this was marked fix commited ages ago... did this actually get released and resolved?

Revision history for this message
Jacopo Rota (r00ta) wrote (last edit ):

I think this bug report is mixing a couple of things: the user was reporting slow performance in the commissioning and he did some investigations around the events table.

The fix from the MAAS team was to include some improvements on the commissioning, but apparently no actions were taken on the events table.
At the current stage (we are about to release 3.4 and we have already worked for 1 cycle on 3.5), we are not rotating the events table which keep growing. I observed slow performances due to the size of the events table.
For this reason, I think we can reference this bug to https://bugs.launchpad.net/maas/+bug/2044895 which is the bug where we suspect that the events table is causing the troubles.

Changed in maas:
status: Fix Committed → Triaged
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.