lttng-tools:master

Last commit made on 2024-04-24
Get this branch:
git clone -b master https://git.launchpad.net/lttng-tools

Branch merges

Branch information

Name:
master
Repository:
lp:lttng-tools

Recent commits

9a28bc0... by Kienan Stewart <email address hidden>

Tests: Add test to check shared-memory FD leaks after relayd dies

Refs: https://bugs.lttng.org/issues/1411

Change-Id: I9804011320c28a9867af1fdc6a8d82ad0671fe3d
Signed-off-by: Kienan Stewart <email address hidden>
Signed-off-by: Jérémie Galarneau <email address hidden>

e793ddc... by =?utf-8?q?J=C3=A9r=C3=A9mie_Galarneau?= <email address hidden>

Fix: consumerd: leak of tracing buffers on relayd connectivity issue

Observed issue
==============

A leak of the tracing buffers can be noticed when the relay daemon is
terminated following the creation of a live session, but prior to the
initiation of any applications.

The issue can be reproduced with the following steps:
  # Create a live session
  $ lttng create --live

  # Kill the relay daemon before the allocation of the buffers
  $ killall lttng-relayd
  $ lttng enable-event --userspace --all
  $ lttng start

  # Launch an instrumented application
  $ ./my_app

  # Destroy all sessions
  $ lttng destroy --all

  # List the open file descriptors of the lttng-consumerd process
  # and notice how the tracing buffer are still visible.
  $ ls -lah /proc/$pid_of_lttng_consumerd/fd

[...]
lrwx------ 1 root root 64 Mar 19 19:50 987 -> '/dev/shm/shm-ust-consumer-358446 (deleted)'
lrwx------ 1 root root 64 Mar 19 19:50 988 -> '/dev/shm/shm-ust-consumer-358446 (deleted)'
lrwx------ 1 root root 64 Mar 19 19:50 989 -> '/dev/shm/shm-ust-consumer-358446 (deleted)'
[...]

Cause
=====

The consumer daemon allocates recording channels and their tracing
buffers in a two-step process.

First, the session daemon emits an `ASK_CHANNEL_CREATION` command, which
results in the allocation of the internal consumer channel structures
and of the actual tracing buffers. The channel's unique key is returned
to the session daemon.

After this command, the channel temporarily holds a list of streams
which are waiting to be sent to the session and relay daemons as a
result of the `GET_CHANNEL` command.

At this moment, the channel's reference count is one over the number of
streams as they all hold a back-reference to their parent channel and
there is a global reference held by the session daemon.

The session daemon uses the key it received to emit the `GET_CHANNEL`
command. When executing this command, the consumer daemon attempts to
send the streams to the relay daemon.

On failure to do so, the session daemon is informed of the connection
error. The consumer daemon then omits a step of the command: the streams
are never handed-off from the channel's internal list to the
consumption/monitoring thread. This hand-off is what is internally
referred-to as making the streams "globally visible".

The session daemon, upon receiving the failure error code of the
GET_CHANNEL command, tears down its internal ust_app channel structures.
As part of that process, it emits the `DESTROY_CHANNEL` command to
reclaim the channel on the consumer daemon's end. This command is
deferred to the channel poll thread as the `CHANNEL_DEL`
internal command.

As part of this internal command, the channel poll thread cleans the
channel's stream list to clean-up any streams that are not "globally
visible": all of them, in our case.

Then, the session daemon's global reference is released which should
normally result in the reclamation of the channel itself.

While reproducing the problem, we noted that channel wasn't reclaimed
and that its reference count matched the number of CPUs on the system at
the time the `CHANNEL_DEL` command completed.

This hinted at the streams holding a reference to the channel even after
the completion of the reclamation command.

Looking at clean_channel_stream_list(), which cleans up the channel's
temporary stream list, we note that the streams' monitor property is
overridden to `false` just before the call to consumer_stream_destroy().

This is strange and a comment (added as part of 212d67a2 in 2014) hints
at a locking problem that was being worked-around. In all likelihood,
this no longer applies as the locking strategies used have evolved quite
a bit since then.

Still, setting the monitor property to `false` is problematic as, in
that mode (say, channels that are used to record a snapshot), streams do
not hold a reference to their parent channel. This causes the clean-up
code to forego the clean-up of the channel, resulting in its leak.

Since the channel ultimately owns the 'stream_fds' which represent the
shared memory files, those files (and associated memory) are also leaked
(they are closed during the execution of lttng_ustconsumer_del_channel()).

Solution
========

We simply remove the stream monitor mode override to leave it in its
appropriate state. The clean-up then proceeds normally, ensuring the
tracing buffers are properly reclaimed.

Known drawbacks
===============

None.

Fixes #1411

Signed-off-by: Jérémie Galarneau <email address hidden>
Change-Id: I4a2fb8cddd2f9da9a2c9df19ba36229627ad2569

bc13dc0... by Kienan Stewart <email address hidden>

docs: Add supported versions and fix-backport policy

Change-Id: Idb22c6487e2397b807c5d1b78acbc2adb03be363
Signed-off-by: Kienan Stewart <email address hidden>
Signed-off-by: Jérémie Galarneau <email address hidden>

048f01e... by .eepp

docs: Partially document the liblttng-ctl C API

This patch:

1. Performs the required changes to make the build system able to build
   an HTML API documentation using Doxygen.

   The way it's done is a replica of what does the Babeltrace 2 project,
   which you may be familiar with.

   `doc/api` is for all API documentation projects while
   `doc/api/liblttng-ctl` is the specific liblttng-ctl API documentation
   project.

   To build and view the HTML API documentation:

   a) Configure the project with the `--enable-api-doc` option.

   b) Build and install the project.

   c) Open
      `$prefix/share/doc/lttng-tools/api/liblttng-ctl/html/index.html`,
      where `$prefix` is the installation prefix (for example,
      `/usr/local`).

2. Fully documents some modules while not documenting others at all.

   Because some liblttng-ctl headers contain functions/types which
   conceptually belong to more than one module (unlike in Babeltrace 2),
   I decided to put all the Doxygen group (module) definitions and any
   "extra" module documentation in `dox/groups.dox`. The latter is a
   huge file of which most of the contents was copied from the
   LTTng-tools manual pages (and from the online LTTng Documentation)
   and adapted to the C API context.

   Images are direct copies from the LTTng Documentation.

   The complete module tree and its state, as of this patch, is as
   follows, where ✅ means it's fully documented and ❌ means it's not
   documented at all:

       ✅ Home page

       ✅ General API (error codes, session daemon connection,
          common definitions)

          Includes parts of `lttng.h`, `lttng-error.h`, and
          `constant.h`.

       ✅ Recording session API

          Includes parts of `lttng.h`, `channel.h`, `handle.h`,
          `domain.h`, and `session.h`.

          ✅ Recording session descriptor API

             Includes all `session-descriptor.h`.

          ✅ Recording session destruction handle API

             Includes all `destruction-handle.h`.

          ✅ Domain and channel API

             Includes parts of `channel.h`, `domain.h`, and `event.h`.

             ✅ Recording event rule API

                Includes parts of `event.h`.

          ❌ Process attribute inclusion set API

             Would include parts of `tracker.h`.

          ✅ Recording session clearing API

             Includes all `clear.h` and `clear-handle.h`.

          ❌ Recording session snapshot API

             Would include all `snapshot.h`.

          ❌ Recording session rotation API

             Would include all `rotation.h` and `location.h`.

          ❌ Recording session saving and loading API

             Would include all `save.h` and `load.h`.

       ✅ Instrumentation point listing API

          Includes parts of `event.h`.

       ❌ Trigger API

          Would include all `trigger/trigger.h`.

          ❌ Trigger condition API

             Would include all `condition/buffer-usage.h`,
             `condition/condition.h`, `condition/evaluation.h`,
             `condition/session-consumed-size.h`, and
             `condition/session-rotation.h`.

             ❌ "Event rule matches" trigger condition API

                Would include all `condition/event-rule-matches.h`.

                ❌ Event rule API

                   Would include all headers in `event-rule` as well
                   as all `kernel-probe.h` and `userspace-probe.h`.

                   ❌ Log level rule API

                      Would include all `log-level-rule.h`.

                ❌ Event expression API

                   Would include all `event-expr.h`.

                ❌ Event field value API

                   Would include all `event-field-value.h`.

          ❌ Trigger action API

             Would include all `action/action.h`,
             `action/firing-policy.h`, `action/list.h`, `action/path.h`,
             `action/rate-policy.h`, `action/rotate-session.h`,
             `action/snapshot-session.h`, `action/start-session.h`, and
             `action/stop-session.h`.

             ❌ Notify trigger action API

                Would include all `action/notify.h`,
                `notification/channel.h`, and
                `notification/notification.h`, as well as parts of
                `endpoint.h`.

       ❌ Error query API

          Would include all `error-query.h`.

   I'm voluntarily not documenting the health API (`health.h`), as I
   believe it's not super important for most users. We could document it
   on demand.

   In `groups.dox`, the groups of the undocumented modules are already
   defined, so that the complete tree above is visible in the rendered
   "API reference" section. The undocumented modules simply show the
   text "To be done". Because there are references to undocumented
   modules in `groups.dox` and in the documented headers, this means
   that the links at least resolve.

   Note that there are non-comment changes in `include/lttng`: I needed
   to name some anonymous, nested types so that I could reference their
   members, as you can only link to the member of a named type with
   Doxygen. For example, the type of the `u` union member of
   `struct lttng_event_context` is now `union lttng_event_context_u`;
   then you can reference its `probe` member as such:
   `lttng_event_context::lttng_event_context_u::probe` (_not_
   `lttng_event_context::u::probe`).

Signed-off-by: Philippe Proulx <email address hidden>
Signed-off-by: Jérémie Galarneau <email address hidden>
Change-Id: I2783419159f4892a992fe5bc760b6e2cd6d13a60

78f5b22... by Xiangyu Chen <email address hidden>

Fix: rotation-destroy-flush: fix session daemon abort if no kernel module present

Testing rotation-destroy-flush when no lttng kernel modules present, it
would be failed with error message:

  Error: Unable to load required module lttng-ring-buffer-client-discard
  not ok 1 - Start session daemon
  Failed test 'Start session daemon'
  not ok 2 - Create session rotation_destroy_flush in -o /tmp/tmp.test_rot ...
  ...

This because test script that sets the LTTNG_ABORT_ON_ERROR environment
variable. It's this environment variable that causes the sessiond to
handle the kernel module loading failure as an abort rather than a
warning.

Using "check_skip_kernel_test" to detect whether the kernel module fails
to load is expected or not. If the failure is expected, the script won't
set that environment variable any more.

Fixes: 3a174400
("tests:add check_skip_kernel_test to check root user and lttng kernel modules")

Change-Id: I371e9ba717613e2940186f710cf3cccd35baed6c
Signed-off-by: Xiangyu Chen <email address hidden>
Signed-off-by: Jérémie Galarneau <email address hidden>

9fed015... by =?utf-8?q?J=C3=A9r=C3=A9mie_Galarneau?= <email address hidden>

Fix: consumerd: wrong timer mentioned in error logging

As its name indicates, consumer_timer_monitor_stop() stops the _monitor_
timer; not the live timer. This is most likely a copy-paste error.

The error logging is fixed to mention the appropriate timer.

Change-Id: I418580d8928752a0702d522e3ca74fe54cbe6f8f
Signed-off-by: Jérémie Galarneau <email address hidden>

fbd566c... by =?utf-8?q?J=C3=A9r=C3=A9mie_Galarneau?= <email address hidden>

Fix: consumerd: type confusion in lttng_consumer_send_error

lttng_consumer_send_error sends an lttcomm_return_code to the session
daemon. However, the size of lttcomm_sessiond_command was used.

This was probably missed since the function accepts an integer instead
of a proper enum type.

The size accepted by the function is changed to use lttcomm_return_code
and the size of a fixed-size type is used to send the error code to the
session daemon.

Signed-off-by: Jérémie Galarneau <email address hidden>
Change-Id: I318e6a8d145373779d11557a70e43abca9783e5c

c91ccad... by =?utf-8?q?J=C3=A9r=C3=A9mie_Galarneau?= <email address hidden>

scope-exit: Clarify scope_exit noexcept requirement

Signed-off-by: Jérémie Galarneau <email address hidden>
Change-Id: Iec34c435327e63e046319fa12f78a74ec50f4163

77c8b54... by =?utf-8?q?J=C3=A9r=C3=A9mie_Galarneau?= <email address hidden>

Clean-up: lttng: utils: missing special member functions

clang-tidy warns:
  warning: class 'session_list' defines a move constructor but does not define a destructor, a copy constructor, a copy assignment operator or a move assignment operator [cppcoreguidelines-special-member-functions]

This warning related to the "Rule of Five":

If a class requires a custom destructor, copy constructor, or copy
assignment operator due to manual resource management, it likely needs
to explicitly define all five (including move constructor and move
assignment operator) to correctly manage the resources across all types
of object copying and moving scenarios. This rule helps prevent resource
leaks, double frees, and other common issues related to resource
management.

Signed-off-by: Jérémie Galarneau <email address hidden>
Change-Id: I970cd1ab905eb877241f7e559b47349b9371f261

13d03b1... by =?utf-8?q?J=C3=A9r=C3=A9mie_Galarneau?= <email address hidden>

Clean-up: clang-tidy: missing headers prevent analysis

clang-tidy complains that some headers omit the inclusion of their
dependencies, which prevents the analysis from completing.

Signed-off-by: Jérémie Galarneau <email address hidden>
Change-Id: Ic6d51e82c5f5536c0d421c38a97afddbe64a16ef