Changing the prometheus web-listen-port leaves the charm in a permanent error state

Bug #1893320 reported by Trent Lloyd
34
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Grafana Charm
Fix Released
Undecided
Brett Milford

Bug Description

* Problem Description *

Changing the related prometheus charm's web-listen-port gets the grafana charm stuck in an error state in two different ways

(1) If an update-status hook is scheduled after the port has changed, but before the grafana-source-relation-changed hook has run to update the URL, the update-status hook gets stuck in error - it tries to query the old URL and the failure to connect exception bubbles up to a hook error

File "charm/reactive/grafana.py", line 580, in configure_sources
  generate_prometheus_dashboards(gf_adminpasswd, ds)
File "charm/reactive/grafana.py", line 853, in generate_prometheus_dashboards
  response = requests.get("{}/api/v1/label/__name__/values".format(ds["url"]))
requests.exceptions.ConnectionError: HTTPConnectionPool(host='10.5.1.43', port=9090): Max retries exceeded with url: /api/v1/label/__name__/values (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fb1be6beda0>: Failed to establish a new connection: [Errno 111] Connection refused',))

(2) Because juju hooks are ordered, this update-status failure prevents the grafana-source-relation-changed hook from running because it attempts to keep re-executing the update-status hook with the same context and old view of the configuration it had before it fialed. We can bypass this with "juju resolved --no-retry grafana/0" which should allow it to progress to the changed hook.

(3) Once the grafana-source-relation-changed hook runs (regardless of whether you got stuck in an update-status hook or not), we get a different error:

File "/var/lib/juju/agents/unit-grafana-0/charm/reactive/grafana.py", line 578, in configure_sources
  check_datasource(ds)
File "/var/lib/juju/agents/unit-grafana-0/charm/reactive/grafana.py", line 687, in check_datasource
  cur.execute(stmt, values)
sqlite3.IntegrityError: UNIQUE constraint failed: data_source.org_id, data_source.name

This happens because the relevant code is keying off of the URL to update entries in the database:
"if row[1] == ds["type"] and row[3] == ds["url"]:" from https://git.launchpad.net/charm-grafana/tree/src/reactive/grafana.py?h=stable/20.08#n677

Since that check fails, it attempts to add a new data source however the grafana sqlite database has a UNIQUE constraint on (org_id, name) so this also fails.

* Reproducer *

juju deploy cs:grafana --config port=3000 --config install_method=snap
juju deploy cs:prometheus2 prometheus
juju add-relation prometheus:grafana-source grafana:grafana-source
juju add-relation telegraf:prometheus-client prometheus:target

<wait for deployment to settle>

juju config prometheus web-listen-port=80

* Suggested Solution *

This code was previously modified NOT to check name, to allow users to change the name in the Grafana configuraton editor to a friendly name they prefer:
https://git.launchpad.net/charm-grafana/commit?h=stable/20.08&id=7540bfadb1cd717ad1c3b44872aa142e97e8308a

So to fix this we will need to either revert that ability, or, find some other way to key the change, possible ideas:
 - storing some kind of tag/metadata
 - using the data source description (currently set to "name - Juju generated source")
 - Using the charmhelpers that store the old configuration data to check the Old URL

* Workaround *
(1) Resolve the broken update-status hook that is failing with the Failed to establish a new connection: [Errno 111] Connection refused

juju resolved grafana/0 --no-retry

(2) Watch "juju debug-log grafana/0" and wait for grafana-source-relation-changed to fail with the error "sqlite3.IntegrityError: UNIQUE constraint failed: data_source.org_id, data_source.name" instead

If you get more hook failures with the "Connection refused" error, re-run the resolved command and wait again and hopefully you will get to the UNIQUE constraint error.

(3) Manually update the grafana source list to use the new URL

You can attempt to do this through the Grafana UI. Settings Menu -> Data Sources -> Click the relevant entry.

However for some reason if you setup a very simple reproduction environment this page throws an error in the Grafana UI "TypeError: Cannot read property 'timeInterval' of undefined". I assume because the reproduction environment only has prometheus2/grafana and no data source like telegraf that triggers this. In such a case, we can update the sqlite configuration file manually:

juju ssh grafana/0 sudo -i
apt-get install sqlite3
sqlite3 /var/snap/grafana/common/data/grafana.db
SELECT * FROM data_source;
UPDATE data_source SET url='http://IP_HOST:PORT' WHERE url='http://IP_HOST:OLD_PORT';
.quit

(4) Mark the error resolved WITHOUT specifying --no-retry, so that the hook retries and should succeed.

juju resolved grafana/0

Tags: sts

Related branches

Changed in charm-grafana:
status: New → Confirmed
tags: added: sts
Revision history for this message
Brett Milford (brettmilford) wrote :

So its interesting to note, the code doesn't traverse the UPDATE path because we're comparing URL's including the port number which by this point has changed.

If its possible to capture and compare the previous URL to be sure we're updating the same entry this would be ideal.

Otherwise I think its sufficient it compare the rest of the URL except the port.

Revision history for this message
Brett Milford (brettmilford) wrote :

Another option might be to separate out joined/change hooks to trigger different flags for insert vs update.

https://git.launchpad.net/interface-grafana-source/tree/requires.py#n9

Changed in charm-grafana:
assignee: nobody → Brett Milford (brettmilford)
Revision history for this message
Drew Freiberger (afreiberger) wrote :

Given that datasource names can't be changed with the new dashboard relations still having hard-coded 'prometheus - juju configured datasource' as the datasource name, I'm okay with the revert of the prior commit, since data source names can't actually be changed successfully any longer.

Celia Wang (ziyiwang)
Changed in charm-grafana:
milestone: none → 21.01
status: Confirmed → Fix Committed
status: Fix Committed → Fix Released
Revision history for this message
Chris Johnston (cjohnston) wrote :
Download full text (5.4 KiB)

After this fix I'm now running into a scenario where grafana goes into an indefinite blocked state.

I've deployed prometheus-21 and grafana-39 with my Kubernetes deployment. After things settle I change the web-listen-port to 80 and see:

grafana/0* blocked idle 4 10.5.1.77 3000/tcp Exception reaching prometheus API whilst updating dashboards

I took a look at the logs for grafana and I see that it looks like it's getting the updated port:

2021-02-25 00:59:06 INFO juju-log Invoking reactive handler: reactive/grafana.py:578:wipe_nrpe_checks
2021-02-25 00:59:06 INFO juju-log Invoking reactive handler: reactive/grafana.py:596:configure_sources
2021-02-25 00:59:06 INFO juju-log Found datasource: {'service_name': 'prometheus', 'type': 'prometheus', 'url': 'http://10.5.2.227:80', 'description': 'Juju generated source'}
2021-02-25 00:59:06 INFO juju-log Datasource already exist, updating: prometheus - Juju generated source
2021-02-25 00:59:06 INFO juju-log Checking Dashboard Template: CephCluster.json.j2
2021-02-25 00:59:06 DEBUG juju-log Skipping Dashboard Template: CephCluster.json.j2 missing 31 metrics.Missing: ceph_client_io_read_ops, ceph_osds, ceph_osds_down, ceph_osd_perf_apply_latency_seconds, ceph_cluster_used_bytes, ceph_cluster_capacity_bytes, ceph_osd_perf_commit_latency_seconds, ceph_misplaced_objects, ceph_monitor_quorum_count, ceph_stale_pgs, ceph_undersized_pgs, ceph_degraded_pgs, ceph_osd_up, ceph_stuck_stale_pgs, ceph_client_io_write_bytes, ceph_degraded_objects, ceph_pool_available_bytes, ceph_unclean_pgs, ceph_client_io_write_ops, ceph_health_status, ceph_recovery_io_bytes, ceph_osds_in, ceph_recovery_io_keys, ceph_recovery_io_objects, ceph_cluster_available_bytes, ceph_cluster_objects, ceph_stuck_unclean_pgs, ceph_stuck_degraded_pgs, ceph_osd_pgs, ceph_client_io_read_bytes, ceph_stuck_undersized_pgs
2021-02-25 00:59:06 INFO juju-log Checking Dashboard Template: Swift.json.j2
2021-02-25 00:59:06 DEBUG juju-log Skipping Dashboard Template: Swift.json.j2 missing 13 metrics.Missing: exec_swiftparts_object_handoff, exec_swiftparts_account_handoff, exec_swiftparts_object_primary, object_server_async_pendings, swift_disk_usage_bytes, exec_swiftparts_container_misplaced, swift_replication_stats, swift_replication_duration_seconds, exec_swiftparts_account_misplaced, exec_swiftparts_account_primary, exec_swiftparts_container_primary, exec_swiftparts_container_handoff, exec_swiftparts_object_misplaced
2021-02-25 00:59:06 INFO juju-log Checking Dashboard Template: OpenStackCloud.json.j2
2021-02-25 00:59:06 DEBUG juju-log Skipping Dashboard Template: OpenStackCloud.json.j2 missing 16 metrics.Missing: nova_resources_ram_mbs, hypervisor_disk_gbs_total, nova_resources_disk_gbs, hypervisor_vcpus_used, hypervisor_disk_gbs_used, hypervisor_memory_mbs_used, neutron_net_size, nova_resources_vcpus, nova_instances, hypervisor_memory_mbs_total, hypervisor_running_vms, openstack_allocation_ratio, openstack_exporter_cache_age_seconds, hypervisor_vcpus_total, openstack_exporter_cache_refresh_duration_seconds, hypervisor_schedulable_instances
2021-02-25 00:59:06 INFO juju-log Checking Dashboard Template: CephOSD....

Read more...

Revision history for this message
Brett Milford (brettmilford) wrote :

@cjohnston I've added a commit to address this issue. Can you please help test it in your environment?

Revision history for this message
Joe Guo (guoqiao) wrote :

I also hit this issue, with prometheus2 rev 22:

juju status
...
prometheus active 1 prometheus2 jujucharms 22 ubuntu
...
grafana/0* blocked idle 0/lxd/2 10.98.160.214 3000/tcp Exception reaching prometheus API whilst updating dashboards

Revision history for this message
Joe Guo (guoqiao) wrote :

Ok, I can confirm the issue is fixed in latest master, with the patch from @brettmilford. Thank you!

Celia Wang (ziyiwang)
Changed in charm-grafana:
milestone: 21.01 → 21.07
Revision history for this message
sahul (buddy001) wrote :

Hi ,
       Error: Exception reaching prometheus API whilst updating dashboards

i get this error, update me how to patch in the existing environment.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.