e793ddc...
by
=?utf-8?q?J=C3=A9r=C3=A9mie_Galarneau?= <email address hidden>
Fix: consumerd: leak of tracing buffers on relayd connectivity issue
Observed issue
==============
A leak of the tracing buffers can be noticed when the relay daemon is
terminated following the creation of a live session, but prior to the
initiation of any applications.
The issue can be reproduced with the following steps:
# Create a live session
$ lttng create --live
# Kill the relay daemon before the allocation of the buffers
$ killall lttng-relayd
$ lttng enable-event --userspace --all
$ lttng start
# Launch an instrumented application
$ ./my_app
# Destroy all sessions
$ lttng destroy --all
# List the open file descriptors of the lttng-consumerd process
# and notice how the tracing buffer are still visible.
$ ls -lah /proc/$pid_of_lttng_consumerd/fd
The consumer daemon allocates recording channels and their tracing
buffers in a two-step process.
First, the session daemon emits an `ASK_CHANNEL_CREATION` command, which
results in the allocation of the internal consumer channel structures
and of the actual tracing buffers. The channel's unique key is returned
to the session daemon.
After this command, the channel temporarily holds a list of streams
which are waiting to be sent to the session and relay daemons as a
result of the `GET_CHANNEL` command.
At this moment, the channel's reference count is one over the number of
streams as they all hold a back-reference to their parent channel and
there is a global reference held by the session daemon.
The session daemon uses the key it received to emit the `GET_CHANNEL`
command. When executing this command, the consumer daemon attempts to
send the streams to the relay daemon.
On failure to do so, the session daemon is informed of the connection
error. The consumer daemon then omits a step of the command: the streams
are never handed-off from the channel's internal list to the
consumption/monitoring thread. This hand-off is what is internally
referred-to as making the streams "globally visible".
The session daemon, upon receiving the failure error code of the
GET_CHANNEL command, tears down its internal ust_app channel structures.
As part of that process, it emits the `DESTROY_CHANNEL` command to
reclaim the channel on the consumer daemon's end. This command is
deferred to the channel poll thread as the `CHANNEL_DEL`
internal command.
As part of this internal command, the channel poll thread cleans the
channel's stream list to clean-up any streams that are not "globally
visible": all of them, in our case.
Then, the session daemon's global reference is released which should
normally result in the reclamation of the channel itself.
While reproducing the problem, we noted that channel wasn't reclaimed
and that its reference count matched the number of CPUs on the system at
the time the `CHANNEL_DEL` command completed.
This hinted at the streams holding a reference to the channel even after
the completion of the reclamation command.
Looking at clean_channel_stream_list(), which cleans up the channel's
temporary stream list, we note that the streams' monitor property is
overridden to `false` just before the call to consumer_stream_destroy().
This is strange and a comment (added as part of 212d67a2 in 2014) hints
at a locking problem that was being worked-around. In all likelihood,
this no longer applies as the locking strategies used have evolved quite
a bit since then.
Still, setting the monitor property to `false` is problematic as, in
that mode (say, channels that are used to record a snapshot), streams do
not hold a reference to their parent channel. This causes the clean-up
code to forego the clean-up of the channel, resulting in its leak.
Since the channel ultimately owns the 'stream_fds' which represent the
shared memory files, those files (and associated memory) are also leaked
(they are closed during the execution of lttng_ustconsumer_del_channel()).
Solution
========
We simply remove the stream monitor mode override to leave it in its
appropriate state. The clean-up then proceeds normally, ensuring the
tracing buffers are properly reclaimed.
1. Performs the required changes to make the build system able to build
an HTML API documentation using Doxygen.
The way it's done is a replica of what does the Babeltrace 2 project,
which you may be familiar with.
`doc/api` is for all API documentation projects while
`doc/api/liblttng-ctl` is the specific liblttng-ctl API documentation
project.
To build and view the HTML API documentation:
a) Configure the project with the `--enable-api-doc` option.
b) Build and install the project.
c) Open
`$prefix/share/doc/lttng-tools/api/liblttng-ctl/html/index.html`,
where `$prefix` is the installation prefix (for example,
`/usr/local`).
2. Fully documents some modules while not documenting others at all.
Because some liblttng-ctl headers contain functions/types which
conceptually belong to more than one module (unlike in Babeltrace 2),
I decided to put all the Doxygen group (module) definitions and any
"extra" module documentation in `dox/groups.dox`. The latter is a
huge file of which most of the contents was copied from the
LTTng-tools manual pages (and from the online LTTng Documentation)
and adapted to the C API context.
Images are direct copies from the LTTng Documentation.
The complete module tree and its state, as of this patch, is as
follows, where ✅ means it's fully documented and ❌ means it's not
documented at all:
✅ Home page
✅ General API (error codes, session daemon connection,
common definitions)
Includes parts of `lttng.h`, `lttng-error.h`, and `constant.h`.
✅ Recording session API
Includes parts of `lttng.h`, `channel.h`, `handle.h`, `domain.h`, and `session.h`.
✅ Recording session descriptor API
Includes all `session-descriptor.h`.
✅ Recording session destruction handle API
Includes all `destruction-handle.h`.
✅ Domain and channel API
Includes parts of `channel.h`, `domain.h`, and `event.h`.
✅ Recording event rule API
Includes parts of `event.h`.
❌ Process attribute inclusion set API
Would include parts of `tracker.h`.
✅ Recording session clearing API
Includes all `clear.h` and `clear-handle.h`.
❌ Recording session snapshot API
Would include all `snapshot.h`.
❌ Recording session rotation API
Would include all `rotation.h` and `location.h`.
❌ Recording session saving and loading API
Would include all `save.h` and `load.h`.
✅ Instrumentation point listing API
Includes parts of `event.h`.
❌ Trigger API
Would include all `trigger/trigger.h`.
❌ Trigger condition API
Would include all `condition/buffer-usage.h`, `condition/condition.h`, `condition/evaluation.h`, `condition/session-consumed-size.h`, and `condition/session-rotation.h`.
❌ "Event rule matches" trigger condition API
Would include all `condition/event-rule-matches.h`.
❌ Event rule API
Would include all headers in `event-rule` as well as all `kernel-probe.h` and `userspace-probe.h`.
❌ Log level rule API
Would include all `log-level-rule.h`.
❌ Event expression API
Would include all `event-expr.h`.
❌ Event field value API
Would include all `event-field-value.h`.
❌ Trigger action API
Would include all `action/action.h`, `action/firing-policy.h`, `action/list.h`, `action/path.h`, `action/rate-policy.h`, `action/rotate-session.h`, `action/snapshot-session.h`, `action/start-session.h`, and `action/stop-session.h`.
❌ Notify trigger action API
Would include all `action/notify.h`, `notification/channel.h`, and `notification/notification.h`, as well as parts of `endpoint.h`.
❌ Error query API
Would include all `error-query.h`.
I'm voluntarily not documenting the health API (`health.h`), as I
believe it's not super important for most users. We could document it
on demand.
In `groups.dox`, the groups of the undocumented modules are already
defined, so that the complete tree above is visible in the rendered
"API reference" section. The undocumented modules simply show the
text "To be done". Because there are references to undocumented
modules in `groups.dox` and in the documented headers, this means
that the links at least resolve.
Note that there are non-comment changes in `include/lttng`: I needed
to name some anonymous, nested types so that I could reference their
members, as you can only link to the member of a named type with
Doxygen. For example, the type of the `u` union member of
`struct lttng_event_context` is now `union lttng_event_context_u`;
then you can reference its `probe` member as such:
`lttng_event_context::lttng_event_context_u::probe` (_not_
`lttng_event_context::u::probe`).
Fix: rotation-destroy-flush: fix session daemon abort if no kernel module present
Testing rotation-destroy-flush when no lttng kernel modules present, it
would be failed with error message:
Error: Unable to load required module lttng-ring-buffer-client-discard
not ok 1 - Start session daemon
Failed test 'Start session daemon'
not ok 2 - Create session rotation_destroy_flush in -o /tmp/tmp.test_rot ...
...
This because test script that sets the LTTNG_ABORT_ON_ERROR environment
variable. It's this environment variable that causes the sessiond to
handle the kernel module loading failure as an abort rather than a
warning.
Using "check_skip_kernel_test" to detect whether the kernel module fails
to load is expected or not. If the failure is expected, the script won't
set that environment variable any more.
Fixes: 3a174400
("tests:add check_skip_kernel_test to check root user and lttng kernel modules")
fbd566c...
by
=?utf-8?q?J=C3=A9r=C3=A9mie_Galarneau?= <email address hidden>
Fix: consumerd: type confusion in lttng_consumer_send_error
lttng_consumer_send_error sends an lttcomm_return_code to the session
daemon. However, the size of lttcomm_sessiond_command was used.
This was probably missed since the function accepts an integer instead
of a proper enum type.
The size accepted by the function is changed to use lttcomm_return_code
and the size of a fixed-size type is used to send the error code to the
session daemon.
77c8b54...
by
=?utf-8?q?J=C3=A9r=C3=A9mie_Galarneau?= <email address hidden>
Clean-up: lttng: utils: missing special member functions
clang-tidy warns:
warning: class 'session_list' defines a move constructor but does not define a destructor, a copy constructor, a copy assignment operator or a move assignment operator [cppcoreguidelines-special-member-functions]
This warning related to the "Rule of Five":
If a class requires a custom destructor, copy constructor, or copy
assignment operator due to manual resource management, it likely needs
to explicitly define all five (including move constructor and move
assignment operator) to correctly manage the resources across all types
of object copying and moving scenarios. This rule helps prevent resource
leaks, double frees, and other common issues related to resource
management.