Merge ~superalpaca/ols-charm-deps:SN1460-add-prometheus-multiproc-dir-config into ols-charm-deps:master

Proposed by Deep Fowdar
Status: Merged
Approved by: Deep Fowdar
Approved revision: 4b199e036fe1bb7813b6ed1535122953dc1e3157
Merge reported by: Otto Co-Pilot
Merged at revision: not available
Proposed branch: ~superalpaca/ols-charm-deps:SN1460-add-prometheus-multiproc-dir-config
Merge into: ols-charm-deps:master
Diff against target: 65 lines (+35/-0)
3 files modified
layer/ols-wsgi/config.yaml (+10/-0)
layer/ols-wsgi/lib/ols/gunicorn.py (+7/-0)
layer/ols-wsgi/templates/config.tmpl (+18/-0)
Reviewer Review Type Date Requested Status
Guillermo Gonzalez Approve
Przemysław Suliga Approve
Review via email: mp+439130@code.launchpad.net

Commit message

Add prometheus multi-process support.

The PROMETHEUS_MULTIPROC_DIR env var can be specified via the config
option 'wsgi_prometheus_multiproc_dir'.

Part of: SN-1460

Description of the change

Follows on from the approach suggested by Przemek in https://code.launchpad.net/~superalpaca/snapstore-common/+git/snapstore-common/+merge/439018.

Ticket: https://warthogs.atlassian.net/browse/SN-1460

I've tested this change with the snapmodels charm, and it seems to work fine.

snapmodels MP: https://code.launchpad.net/~superalpaca/snapmodels/+git/snapmodels/+merge/439131

To post a comment you must log in.
Revision history for this message
Przemysław Suliga (suligap) wrote :

Nice, thanks!

Just skimmed it but it looks good.

Thoughts on a boolean wsgi_prometheus_multiproc setting instead (off by default)?:

- on_starting() run in the arbiter could then simply always [re]create the dir if the setting is on (and add its PID in the name).

- The users of the layer probably don't care about the actual dir.

- It'd basically be an option that "handles proper prometheus client setup for multiproc".

Exposing wsgi_prometheus_multiproc_dir would need to be explained more, eg. "It will get wiped on restarts".

Revision history for this message
Przemysław Suliga (suligap) :
Revision history for this message
Guillermo Gonzalez (verterok) wrote :

how does this play with existing talisker based services that already are fiddling with on_starting and child_exit?

review: Needs Information
Revision history for this message
Maximiliano Bertacchini (maxiberta) wrote (last edit ):

It looks good! As an alternative, what about skipping any extra config option and just add the multiproc logic if `worker_class == "uvicorn.workers.UvicornWorker"` with a temp dir? That's what talisker does automatically if running with multiple workers and no explicit env var has been set (which is how our services run in staging/prod afaics).

Revision history for this message
Przemysław Suliga (suligap) wrote :

> how does this play with existing talisker based services that already
> are fiddling with on_starting and child_exit?

This behavior is opt in and it's very unlikely you'd want to use it
with talisker is the simple answer I think. But that's a good question.
I'd expect talisker to do the right things: honor the env var and also
run the provided hooks. Don't know though.

> It looks good! As an alternative, what about skipping any extra config
> option and just add the multiproc logic if `worker_class ==
> "uvicorn.workers.UvicornWorker"` with a temp dir? That's what talisker
> does automatically if running with multiple workers and no explicit env
> var has been set (which is how our services run in staging/prod
> afaics).

I think these are separate concerns. Eg. you might want to use a single
uvicorn worker and not have the charm meddle with how prometheus is set
up?

Hmm, if anything I think the condition could be based on if you're using
talisker or not. But again, it's a bit magical then?

Revision history for this message
Maximiliano Bertacchini (maxiberta) wrote :

> > It looks good! As an alternative, what about skipping any extra config
> > option and just add the multiproc logic if `worker_class ==
> > "uvicorn.workers.UvicornWorker"` with a temp dir? That's what talisker
> > does automatically if running with multiple workers and no explicit env
> > var has been set (which is how our services run in staging/prod
> > afaics).
>
> I think these are separate concerns. Eg. you might want to use a single
> uvicorn worker and not have the charm meddle with how prometheus is set
> up?

You're right! But that was just an example. My question was really "Can we skip the config option and be smarter about it by handling this automatically? Not really sure how :/

> Hmm, if anything I think the condition could be based on if you're using
> talisker or not. But again, it's a bit magical then?

It's not too magical to me; it's preventing metrics corruption :). Otherwise, as I asked in the other MP, what should we do if running in async mode but the env var has not been set? Anyway, if there's no simple way to deduce it, let's just add the flag explicitly. I aggre that it should be a boolean.

Revision history for this message
Deep Fowdar (superalpaca) wrote (last edit ):

Thanks for the reviews everyone!

I think it makes sense to update the config option to be a boolean (false/off by default), so I've gone ahead and done that. If this option is set, then gunicorn will override the PROMETHEUS_MULTIPROC_DIR env vars, provided we have more than 1 worker.

I'm not too sure how this will interact with talisker, but it's probably best to not set this when talisker is being used. Maybe adding a check for talisker (like https://pastebin.canonical.com/p/yVRpQ5k9jt/) could work?

Revision history for this message
Guillermo Gonzalez (verterok) wrote (last edit ):

Maybe could add to the config decription this is only needed for non-wsgi/talisker services? (as those will do the right thing)

Revision history for this message
Deep Fowdar (superalpaca) wrote :

> Maybe could add to the config decription this is only needed for non-
> wsgi/talisker services? (as those will do the right thing)

Makes sense, will add this

Revision history for this message
Przemysław Suliga (suligap) wrote :

This looks great, but check some small comments.

Revision history for this message
Przemysław Suliga (suligap) wrote :

Found one more issue and proposed a solution.

Revision history for this message
Deep Fowdar (superalpaca) wrote :

Thanks for checking this again. I've replied to the diff comments and will apply the suggested changes/fixes

Revision history for this message
Przemysław Suliga (suligap) wrote :

Thank you, +1.

review: Approve
Revision history for this message
Guillermo Gonzalez (verterok) wrote :

Hi,

I think it would be nice if we can re-use the existing mechanisms in ols-layers to set environment variables and handle gunicorn restarts

Here is a quick example: https://pastebin.canonical.com/p/G5T8ZjVnbv/

Some context on the chat with suligap about this: https://chat.canonical.com/canonical/pl/3zk7zkzyqj8kdrmswbyq61n43e

Let me know what you think about the proposed changes.

Also, I'm not sure this is covering all the edge cases (after looking at talisker), but let's get this first and worry about the edge cases later ;-)

Revision history for this message
Deep Fowdar (superalpaca) wrote :

I've had a look at the new changes and I think they're better than the current ones. I had no idea about the systemd private tmp dirs, so TIL. I tried out the patch you linked but got the following error on charm startup:
"""
File "/srv/snapmodels/code/env/lib/python3.10/site-packages/gunicorn/arbiter.py", line 534, in reap_workers
    self.cfg.child_exit(self, worker)
  File "/srv/gunicorn/snapmodels.py", line 31, in child_exit
    multiprocess.mark_process_dead(worker.pid)
AttributeError: partially initialized module 'prometheus_client.multiprocess' has no attribute 'mark_process_dead' (most likely due to a circular import)
"""
I moved the import out of the function into the top-level and that seems to have fixed it.

I'll push through the changes soon enough though.

Revision history for this message
Guillermo Gonzalez (verterok) :
review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1diff --git a/layer/ols-wsgi/config.yaml b/layer/ols-wsgi/config.yaml
2index df316d9..daad490 100644
3--- a/layer/ols-wsgi/config.yaml
4+++ b/layer/ols-wsgi/config.yaml
5@@ -49,3 +49,13 @@ options:
6 type: int
7 default: 0
8 description: "The max. number of seconds to wait before actually restarting/reloading"
9+ wsgi_use_prometheus_multiproc:
10+ type: boolean
11+ default: false
12+ description: >
13+ Handles proper prometheus client setup for multiproc with gunicorn.
14+ NOTE: This requires the prometheus client from https://github.com/prometheus/client_python.
15+ Toggling this option results in a gunicorn restart.
16+ This will also set the PROMETHEUS_MULTIPROC_DIR (and lowercase version) env vars if set.
17+ This should not be set to true if using talisker-based services, as those should
18+ handle prometheus multiproc metrics properly.
19\ No newline at end of file
20diff --git a/layer/ols-wsgi/lib/ols/gunicorn.py b/layer/ols-wsgi/lib/ols/gunicorn.py
21index b5520db..60a3d60 100755
22--- a/layer/ols-wsgi/lib/ols/gunicorn.py
23+++ b/layer/ols-wsgi/lib/ols/gunicorn.py
24@@ -107,8 +107,15 @@ def configure(working_dir, port, user, group, env_extra, access_logfile='-'):
25 'wsgi_max_requests',
26 'wsgi_logrotate_count',
27 'wsgi_reuse_port',
28+ 'wsgi_use_prometheus_multiproc',
29 ):
30 wsgi_config[conf_name.replace('wsgi_', '')] = charm_config[conf_name]
31+ if charm_config['wsgi_use_prometheus_multiproc']:
32+ # as systemd provides private temp directories there is no need to make
33+ # this unique
34+ multiproc_tmp = '/tmp/prometheus_multiproc'
35+ env_extra['prometheus_multiproc_dir'] = multiproc_tmp
36+ env_extra['PROMETHEUS_MULTIPROC_DIR'] = multiproc_tmp
37
38 wsgi_config['working_dir'] = working_dir
39 service_name = sanitized_service_name()
40diff --git a/layer/ols-wsgi/templates/config.tmpl b/layer/ols-wsgi/templates/config.tmpl
41index 2ebc7f4..32450b6 100644
42--- a/layer/ols-wsgi/templates/config.tmpl
43+++ b/layer/ols-wsgi/templates/config.tmpl
44@@ -20,3 +20,21 @@ reuse_port = True
45 {% if access_logfile -%}
46 accesslog = "{{ access_logfile }}"
47 {% endif -%}
48+
49+{% if use_prometheus_multiproc -%}
50+
51+from prometheus_client import multiprocess
52+
53+def setup_prometheus_multiproc():
54+ import os
55+ os.makedirs(os.environ['prometheus_multiproc_dir'], mode=0o700, exist_ok=True)
56+
57+
58+def on_starting(server):
59+ setup_prometheus_multiproc()
60+
61+
62+def child_exit(server, worker):
63+ multiprocess.mark_process_dead(worker.pid)
64+
65+{% endif -%}

Subscribers

People subscribed via source and target branches