lttng-modules:stable-2.0

Last commit made on 2013-07-11
Get this branch:
git clone -b stable-2.0 https://git.launchpad.net/lttng-modules

Branch merges

Branch information

Name:
stable-2.0
Repository:
lp:lttng-modules

Recent commits

01ea6a7... by Mathieu Desnoyers

Version 2.0.8

Signed-off-by: Mathieu Desnoyers <email address hidden>

5a0629a... by Mathieu Desnoyers

Fix: ring buffer: get_subbuf() checks should be performed on "consumed" parameter

This triggers lots of false-positive -EAGAIN errors in flight recorder
snapshots.

Reported-by: Julien Desfossez <email address hidden>
Signed-off-by: Mathieu Desnoyers <email address hidden>

d99e79f... by Mathieu Desnoyers

Fix: SWITCH_FLUSH new sub-buffer checks

The SWITCH_FLUSH, when performed on a completely empty sub-buffer, was
missing some checks (imported from space reservation).

Signed-off-by: Mathieu Desnoyers <email address hidden>

0b38773... by Mathieu Desnoyers

Fix: ring buffer: handle concurrent update in nested buffer wrap around check

With stress-test loads that trigger sub-buffer switch very frequently
(small 4kB sub-buffers, frequent flush), we currently observe this kind
of warnings once every few minutes:

[65335.896208] ring buffer relay-overwrite-mmap, cpu 5: records were lost. Caused by:
[65335.896208] [ 0 buffer full, 1 nest buffer wrap-around, 0 event too big ]

It appears that the check for nested buffer wrap-around does not take
into account that a concurrent execution contexts (either nested for
per-cpu buffers, or from another CPU or nested for global buffers) can
update the commit_count value concurrently.

What we really want to do with this check is to ensure that if we enter
a sub-buffer that had an unbalanced reserve/commit count, assuming there
is no hope that this gets rebalanced promptly, we detect this and drop
the current event. However, in the case where the commit counter has
been concurrently updated by another reserve or a switch, we want to
retry the entire reserve operation.

One way to detect this is to sample the reserve offset twice, around the
commit counter read, along with the appropriate memory barriers.
Therefore, we can detect if the mismatch between reserve and commit
counter is actually caused by a concurrent update, which necessarily has
updated the reserve counter.

Signed-off-by: Mathieu Desnoyers <email address hidden>

ee14661... by Mathieu Desnoyers

Fix: handle writes of length 0

lib_ring_buffer_write(), lib_ring_buffer_memset() and
lib_ring_buffer_copy_from_user_inatomic() could be passed a length of 0.
This typically has no side-effect as far as writing into the buffers is
concerned, except for one detail: in overwrite mode, there is a check to
make sure the sub-buffer can be written into. This check is performed
even if length is 0. In the case where this would fall exactly at the
end of a sub-buffer, the check would fail, because the offset would fall
exactly at the beginning of the next sub-buffer.

It triggers this warning:

[65356.890016] ------------[ cut here ]------------
[65356.890016] WARNING: at /home/compudj/git/lttng-modules/wrapper/ringbuffer/../../lib/ringbuffer/../../wrapper/ringbuffer/../../lib/ringbuffer/backend.h:110 lttng_event_write+0x118/0x140 [lttng_ring_buffer_client_mmap_overwrite]()
[65356.890016] Hardware name: X7DAL
[65356.890016] Modules linked in: lttng_probe_writeback(O) lttng_probe_workqueue(O) lttng_probe_vmscan(O) lttng_probe_udp(O) lttng_probe_timer(O) lttng_probe_sunrpc(O) lttng_probe_statedump(O) lttng_probe_sock(O) lttng_probe_skb(O) lttng_probe_signal(O) lttng_probe_scsi(O) lttng_probe_sched(O) lttng_probe_rcu(O) lttng_probe_random(O) lttng_probe_printk(O) lttng_probe_power(O) lttng_probe_net(O) lttng_probe_napi(O) lttng_probe_module(O) lttng_probe_kvm(O) lttng_probe_kmem(O) lttng_probe_jbd2(O) lttng_probe_jbd(O) lttng_probe_irq(O) lttng_probe_ext4(O) lttng_probe_ext3(O) lttng_probe_compaction(O) lttng_probe_btrfs(O) lttng_probe_block(O) lttng_types(O) lttng_ring_buffer_metadata_mmap_client(O) lttng_ring_buffer_client_mmap_overwrite(O) lttng_ring_buffer_client_mmap_discard(O) lttng_ring_buffer_metadata_client(O) lttng_ring_buffer_client_overwrite(O) lttng_ring_buffer_client_discard(O) lttng_tracer(O) lttng_kretprobes(O) lttng_ftrace(O) lttng_kprobes(O) lttng_statedump(O) lttng_lib_ring_buffer(O) cpufreq_ondemand loop e1000e kvm_intel kvm ptp pps_core [last unloaded: lttng_lib_ring_buffer]
[65357.287529] Pid: 0, comm: swapper/7 Tainted: G O 3.9.4-trace-test #143
[65357.309694] Call Trace:
[65357.317022] <IRQ> [<ffffffff8103a3ef>] warn_slowpath_common+0x7f/0xc0
[65357.336893] [<ffffffff8103a44a>] warn_slowpath_null+0x1a/0x20
[65357.354368] [<ffffffffa0ff17b8>] lttng_event_write+0x118/0x140 [lttng_ring_buffer_client_mmap_overwrite]
[65357.383025] [<ffffffffa100134f>] __event_probe__block_rq_with_error+0x1bf/0x220 [lttng_probe_block]
[65357.410376] [<ffffffff812ea134>] blk_update_request+0x324/0x720
[65357.428364] [<ffffffff812ea561>] blk_update_bidi_request+0x31/0x90
[65357.447136] [<ffffffff812eb68c>] blk_end_bidi_request+0x2c/0x80
[65357.465127] [<ffffffff812eb6f0>] blk_end_request+0x10/0x20
[65357.481822] [<ffffffff81406b7c>] scsi_io_completion+0x9c/0x670
[65357.499555] [<ffffffff813fe320>] scsi_finish_command+0xb0/0xe0
[65357.517283] [<ffffffff81406965>] scsi_softirq_done+0xa5/0x140
[65357.534758] [<ffffffff812f1d30>] blk_done_softirq+0x80/0xa0
[65357.551710] [<ffffffff81043b00>] __do_softirq+0xe0/0x440
[65357.567881] [<ffffffff81043ffe>] irq_exit+0x9e/0xb0
[65357.582754] [<ffffffff81026465>] smp_call_function_single_interrupt+0x35/0x40
[65357.604388] [<ffffffff8167be2f>] call_function_single_interrupt+0x6f/0x80
[65357.624976] <EOI> [<ffffffff8100ac06>] ? default_idle+0x46/0x300
[65357.643541] [<ffffffff8100ac04>] ? default_idle+0x44/0x300
[65357.660235] [<ffffffff8100b899>] cpu_idle+0x89/0xe0
[65357.675109] [<ffffffff81664911>] start_secondary+0x220/0x227

Always from an event that can write a 0-length field as last field of
its payload, and it always happen directly on a sub-buffer boundary.

While we are there, check for length 0 in lib_ring_buffer_read_cstr()
too.

Signed-off-by: Mathieu Desnoyers <email address hidden>

18a74f1... by Mathieu Desnoyers

Fix: ring buffer: RING_BUFFER_FLUSH ioctl buffer corruption

lib_ring_buffer_switch_slow() clearly states:

 * Note, however, that as a v_cmpxchg is used for some atomic
 * operations, this function must be called from the CPU which owns the
 * buffer for a ACTIVE flush.

But unfortunately, the RING_BUFFER_FLUSH ioctl does not follow these
important directives. Therefore, whenever the consumer daemon or session
daemon explicitly triggers a "flush" on a buffer, it can race with data
being written to the buffer, leading to corruption of the reserve/commit
counters, and therefore corruption of data in the buffer. It triggers
these warnings for overwrite mode buffers:

[65356.890016] WARNING: at
/home/compudj/git/lttng-modules/wrapper/ringbuffer/../../lib/ringbuffer/../../wrapper/ringbuffer/../../lib/ringbuffer/backend.h:110 lttng_event_write+0x118/0x140 [lttng_ring_buffer_client_mmap_overwrite]()

Which indicates that we are trying to write into a sub-buffer for which
we don't have exclusive access. It also causes the following warnings to
show up:

[65335.896208] ring buffer relay-overwrite-mmap, cpu 5: records were lost. Caused by:
[65335.896208] [ 0 buffer full, 80910 nest buffer wrap-around, 0 event too big ]

Which is caused by corrupted commit counter.

Fix this by sending an IPI to the CPU owning the flushed buffer for
per-cpu synchronization. For global synchronization, no IPI is needed,
since we allow writes from remote CPUs.

Signed-off-by: Mathieu Desnoyers <email address hidden>

a94a9ec... by Mathieu Desnoyers

Version 2.0.7

Signed-off-by: Mathieu Desnoyers <email address hidden>

d25a4ce... by Samuel Martin <email address hidden>

Fix build and load against linux-2.6.33.x

* lttng-event.h declared but did not implement
  lttng_add_perf_counter_to_ctx on kernel >=2.6.33, the implementation
  was in lttng-context-perf-counters.c, which was only included for
  kernel >=2.6.34. This prevented the module from being loaded.

* on kernel 2.6.33.x, lttng-context-perf-counters.c complains about
  implicit declaration for {get,put}_online_cpus and
  {,un}register_cpu_notifier; so fix header inclusion.

Signed-off-by: Samuel Martin <email address hidden>
Signed-off-by: Mathieu Desnoyers <email address hidden>

6574f5b... by Simon Marchi

Fix check in lttng_strlen_user_inatomic

__copy_from_user_inatomic returns the number of bytes that could not be
copied, not an error code. This fixes the test accordingly.

[ Edit by Mathieu Desnoyers: change "ret" type to unsigned long too. ]

Signed-off-by: Simon Marchi <email address hidden>
Signed-off-by: Mathieu Desnoyers <email address hidden>

2658640... by Mathieu Desnoyers

Fix: statedump hang/too early completion due to logic error

The previous "Fix: statedump hang due to incorrect wait/wakeup use" was
not actually fixing the real problem.

The issue is that we should pass the expected condition to wait_event()
rather than its contrary.

This bug has been sitting there for a while. I suspect that a recent
change in the Linux scheduler behavior for newly spawned worker threads
might have contributed to trigger the hang more reliably.

The effects of this bugs are:
- possible hang of the lttng-sessiond (within the kernel) at tracing
  start,
- the statedump end event is traced before all worker threads have
  actually completed, which can confuse LTTng viewer state systems.

Reported-by: Phil Wilshire <email address hidden>
Signed-off-by: Mathieu Desnoyers <email address hidden>