Grafana Charm

Code review comment for ~mruffell/charm-grafana:lp1894011

Revision history for this message

Matthew Ruffell (mruffell) wrote on 2020-09-16:

I understand why usage of time.sleep() can lead to flaky tests, but in this specific example, I think it is a necessary evil, and it actually simplifies my code instead of making it more complicated.

Let me explain.

When we enable dashboard backup by setting

juju config grafana dashboards_backup_schedule="* * * * *"

This creates a cronjob, in /etc/cron.d/juju-dashboards-backup, which runs once every 60 seconds, due to the timer being "* * * * *".

At this point in time, the backup directory, /srv/backups doesn't exist, and we have no idea what the project / organisation name is, so we cannot watch the backup directory /srv/backups/<project name>. Although so far I have only seen "Main_Org_" as the project name.

When the cronjob runs, it backs up the dashboard configuration to /srv/backups/<project name>/<dashboard name>.json.<timestamp>

An few examples are:

/srv/backups/Main_Org_/juju-prometheus-libvirt-exporter-functionaltestmarker-metrics.json.20200915_233401
/srv/backups/Main_Org_/juju-prometheus-libvirt-exporter-functionaltestmarker-metrics.json.20200915_233501
/srv/backups/Main_Org_/juju-prometheus-libvirt-exporter-functionaltestmarker-metrics.json.20200915_233602
/srv/backups/Main_Org_/juju-prometheus-libvirt-exporter-libvirt-metrics.json.20200915_233401
/srv/backups/Main_Org_/juju-prometheus-libvirt-exporter-libvirt-metrics.json.20200915_233501
/srv/backups/Main_Org_/juju-prometheus-libvirt-exporter-libvirt-metrics.json.20200915_233602

Now, filenames are dependant on dashboard data sources, so the charms that grafana have a relation to cannot change.

There is also a small problem of the timestamp at the end, since it has seconds present.

I left the backup setting enabled for ten minutes or so:

https://paste.ubuntu.com/p/RVmDCqTPGx/

We can see the seconds vary between 01 and 02, depending on machine load.

Now, If I used model.block_until_file_has_contents(), I would have to be able to predict the exact filename, which is impossible, as some files will have 00 seconds, 01, 02, 03 or 04 or more seconds.

I would also have to wait for a file one minute in the future, since if the cronjob did not fire in the current minute, no files would exist, so I would need to wait for the next one.

This would turn my test code for verify_iterative_backups() into something like:

from datetime import datetime, timedelta

DEFAULT_BACKUP_ORG = "Main_Org_"
DEFAULT_FUNCTEST_PROJECT = "juju-prometheus-libvirt-exporter"
DEFAULT_FUNCTEST_DASHBOARD = "functionaltestmarker-metrics"

    file_name_1_minutes = "{project}-{dashboard}.json.{date}_{time}".format(
        project=DEFAULT_FUNCTEST_PROJECT,
        dashboard=DEFAULT_FUNCTEST_DASHBOARD,
        date=time_in_1_minutes.strftime('%Y%m%d'),
        time=time_in_1_minutes.strftime('%H%M00')
        )
    )
    file_name_2_minutes = "{project}-{dashboard}.json.{date}_{time}".format(
        project=DEFAULT_FUNCTEST_PROJECT,
        dashboard=DEFAULT_FUNCTEST_DASHBOARD,
        date=time_in_2_minutes.strftime('%Y%m%d'),
        time=time_in_2_minutes.strftime('%H%M00')
        )
    )
    model.block_until_file_has_contents(
        application_name=self.application_name,
        remote_file="{backupdir}/{org}/{filename}".format(
            backupdir=DEFAULT_BACKUP_DIRECTORY,
            org=DEFAULT_BACKUP_ORG,
            filename=file_name_1_minutes
        ),
        expected_contents="meta"
    )
    model.block_until_file_has_contents(
        application_name=self.application_name,
        remote_file="{backupdir}/{org}/{filename}".format(
            backupdir=DEFAULT_BACKUP_DIRECTORY,
            org=DEFAULT_BACKUP_ORG,
            filename=file_name_2_minutes
        ),
        expected_contents="meta"
    )

But we then have a flaky test as my machine could be running slow, and the file would show up with seconds set to 01 or 02, and my test would be blocking forever for a filename that will never appear.

Also, it would rely on the target test VM / lxd container to be using UTC time, and not a local timezone, as if a local time is used, the filename would be incorrect.

Also, the project names or any of the dashboard names could change.

Or, I could just wait for 65 seconds, and I know that the cronjob would have fired off in that time, since it runs every 60 seconds, and do a directory listing and count the number of .json files.

If there is 1 or more, then backups work. If I wait another 65 seconds, the cronjob would fire off, and I can see if a new backup is added, and if I had more backups than I did previously, things are working.

I know that time.sleep() is hacky, but it doesn't rely on magic numbers / magic filenames / magic directory names, and doesn't depend on guessing the correct second to have a correct filename, or correctly guessing the timezone.

Things would be easier if I could pass a wildcard into model.block_until_file_has_contents(), since I could match on "/srv/backups/*/*.json.*_<hour><minute>*" but it would still be fragile to timezone differences on test host vs VM / lxd container. Wildcards are impossible though, since we "cat" the target, and the shell cannot perform expansion:
https://github.com/openstack-charmers/zaza/blob/adb2a59debba91cf6188c16e534353d53c3ce2a4/zaza/model.py#L1278

So, I think the implementation in the current merge request has its merits, and time.sleep() solves a lot of problems.

I understand why usage of time.sleep() can lead to flaky tests, but in this specific example, I think it is a necessary evil, and it actually simplifies my code instead of making it more complicated.

Let me explain.

When we enable dashboard backup by setting

juju config grafana dashboards_backup_schedule="* * * * *"

This creates a cronjob, in /etc/cron.d/juju-dashboards-backup, which runs once every 60 seconds, due to the timer being "* * * * *".

When the cronjob runs, it backs up the dashboard configuration to /srv/backups/<project name>/<dashboard name>.json.<timestamp>

An few examples are:

Now, filenames are dependant on dashboard data sources, so the charms that grafana have a relation to cannot change.

There is also a small problem of the timestamp at the end, since it has seconds present.

I left the backup setting enabled for ten minutes or so:

https://paste.ubuntu.com/p/RVmDCqTPGx/

We can see the seconds vary between 01 and 02, depending on machine load.

Now, If I used model.block_until_file_has_contents(), I would have to be able to predict the exact filename, which is impossible, as some files will have 00 seconds, 01, 02, 03 or 04 or more seconds.

I would also have to wait for a file one minute in the future, since if the cronjob did not fire in the current minute, no files would exist, so I would need to wait for the next one.

This would turn my test code for verify_iterative_backups() into something like:

from datetime import datetime, timedelta

DEFAULT_BACKUP_ORG = "Main_Org_"
DEFAULT_FUNCTEST_PROJECT = "juju-prometheus-libvirt-exporter"
DEFAULT_FUNCTEST_DASHBOARD = "functionaltestmarker-metrics"

def verify_iterative_backups(self):
    """Wait for new backups to appear to ensure they work"""
    current_time = datetime.utcnow()
    time_in_1_minutes = current_time + timedelta(minutes = 1)
    time_in_2_minutes = current_time + timedelta(minutes = 2)
    
    file_name_1_minutes = "{project}-{dashboard}.json.{date}_{time}".format(
        project=DEFAULT_FUNCTEST_PROJECT,
        dashboard=DEFAULT_FUNCTEST_DASHBOARD,
        date=time_in_1_minutes.strftime('%Y%m%d'),
        time=time_in_1_minutes.strftime('%H%M00')
        )
    )
    file_name_2_minutes = "{project}-{dashboard}.json.{date}_{time}".format(
        project=DEFAULT_FUNCTEST_PROJECT,
        dashboard=DEFAULT_FUNCTEST_DASHBOARD,
        date=time_in_2_minutes.strftime('%Y%m%d'),
        time=time_in_2_minutes.strftime('%H%M00')
        )
    )
    model.block_until_file_has_contents(
        application_name=self.application_name,
        remote_file="{backupdir}/{org}/{filename}".format(
            backupdir=DEFAULT_BACKUP_DIRECTORY,
            org=DEFAULT_BACKUP_ORG,
            filename=file_name_1_minutes
        ),
        expected_contents="meta"
    )
    model.block_until_file_has_contents(
        application_name=self.application_name,
        remote_file="{backupdir}/{org}/{filename}".format(
            backupdir=DEFAULT_BACKUP_DIRECTORY,
            org=DEFAULT_BACKUP_ORG,
            filename=file_name_2_minutes
        ),
        expected_contents="meta"
    )
    
But we then have a flaky test as my machine could be running slow, and the file would show up with seconds set to 01 or 02, and my test would be blocking forever for a filename that will never appear.

Also, it would rely on the target test VM / lxd container to be using UTC time, and not a local timezone, as if a local time is used, the filename would be incorrect.

Also, the project names or any of the dashboard names could change.

Or, I could just wait for 65 seconds, and I know that the cronjob would have fired off in that time, since it runs every 60 seconds, and do a directory  listing and count the number of .json files.

Things would be easier if I could pass a wildcard into model.block_until_file_has_contents(), since I could match on  "/srv/backups/*/*.json.*_<hour><minute>*" but it would still be fragile to timezone differences on test host vs VM / lxd container. Wildcards are impossible though, since we "cat" the target, and the shell cannot perform expansion: 
https://github.com/openstack-charmers/zaza/blob/adb2a59debba91cf6188c16e534353d53c3ce2a4/zaza/model.py#L1278

So, I think the implementation in the current merge request has its merits, and time.sleep() solves a lot of problems.

« Back to merge proposal