aba1862...
by
Balazs Nemeth <email address hidden>
ofproto-dpif-upcall: Don't set statistics to 0 when they jump back.
The only way that stats->{n_packets,n_bytes} would decrease is due to an
overflow, or if there are bugs in how statistics are handled. In the
past, there were multiple issues that caused a jump backward. A
workaround was in place to set the statistics to 0 in that case. When
this happened while the revalidator was under heavy load, the workaround
had an unintended side effect where should_revalidate returned false
causing the flow to be removed because the metric it calculated was
based on a bogus value. Since many of those bugs have now been
identified and resolved, there is no need to set the statistics to 0. In
addition, the (unlikely) overflow still needs to be handled
appropriately. If an unexpected jump does happen, just log it as a
warning.
ovsdb: monitor: Destroy initial change set when new columns added.
Initial change set is preserved for as long as the monitor itself.
However, if a new client has a condition on a column that is not
one of the monitored columns, this column will be added to the
monitor via ovsdb_monitor_condition_bind(). This new column, however,
doesn't exist in the initial change set. That will cause ovsdb-server
to malfunction or crash trying to access non-existent column during
condition evaluation:
Fix that by destroying the initial change set every time new columns
are added to the monitor. This will trigger re-generation of the
change set and it will contain all the necessary columns afterwards.
Fixes: 07c27226ee96 ("ovsdb: Monitor: Keep and maintain the initial change set.")
Reported-by: Han Zhou <email address hidden>
Acked-by: Han Zhou <email address hidden>
Reviewed-by: Simon Horman <email address hidden>
Signed-off-by: Ilya Maximets <email address hidden>
ovsdb: Monitor: Keep and maintain the initial change set.
Change sets in OVSDB monitor are storing all the changes that happened
between a particular transaction ID and now. Initial change set
basically contains all the data.
On each monitor request a new initial change set is created by creating
an empty change set and adding all the database rows. Then it is
converted into JSON reply and immediately untracked and destroyed.
This is causing significant performance issues if many clients are
requesting new monitors at the same time. For example, that is
happening after database schema conversion, because conversion triggers
cancellation of all monitors. After cancellation, every client sends
a new monitor request. The server then creates a new initial change
set, sends a reply, destroys initial change set and repeats that for
each client. On a system with 200 MB database and 500 clients,
cluster of 3 servers spends 20 minutes replying to all the clients
(200 MB x 500 = 100 GB):
timeval|WARN|Unreasonably long 1201525ms poll interval
Of course, all the clients are already disconnected due to inactivity
at this point. When they are re-connecting back, server accepts new
connections one at a time, so inactivity probes will not be triggered
anymore, but it still takes another 20 minutes to handle all the
incoming connections.
Let's keep the initial change set around for as long as the monitor
itself exists. This will allow us to not construct a new change set
on each new monitor request and even utilize the JSON cache in some
cases. All that at a relatively small maintenance cost, since we'll
need to commit changes to one extra change set on every transaction.
Measured memory usage increase due to keeping around a shallow copy
of a database is about 10%. Measured CPU usage difference during
normal operation is negligible.
With this change it takes only 30 seconds to send out all the monitor
replies in the example above. So, it's a 40x performance improvement.
On a more reasonable setup with 250 nodes, the process takes up to
8-10 seconds instead of 4-5 minutes.
Conditional monitoring will benefit from this change as well, however
results might be less impressive due to lack of JSON cache.
Reviewed-by: Simon Horman <email address hidden>
Acked-by: Dumitru Ceara <email address hidden>
Signed-off-by: Ilya Maximets <email address hidden>
fatal-signal: Don't share signal fds/handles with forked process.
The signal_fds pipe and wevent are a mechanism to wake up the process
after it received a signal and stored the number for the future
processing. They are not intended for inter-process communication.
However, in the current code, descriptors are not closed on fork().
The main scenario where we use fork() is a monitor process. Monitor
doesn't actually use poll loops and doesn't wait on the descriptor.
But when a child process is killed, it (child) sends a byte to itself,
then it wakes up due to POLLIN on the pipe and terminates itself after
processing all the callbacks. The byte stays unread. And the pipe is
still open in the monitor process. When child dies, the monitor wakes
up and forks again. New child inherits the same pipe that still
contains unread data. This data is never read, so the child will
constantly wake itself up for no reason.
Interestingly enough raise(SIGSEGV) doesn't immediately kill the
process. The execution continues til the end of a signal handler,
so we're still able to write a byte to a pipe even in this case.
Presumably because we don't have SA_NODEFER.
Fix the issue by re-creating the pipe/event on fork. This way
every new child will have its own notification channel and will
not wake up any other processes.
There was already an attempt to fix the issue, but it didn't get a
follow up (see the reported-at tag). This is an alternative solution.
Fixes: ff8decf1a318 ("daemon: Add support for process monitoring and restart.")
Reported-at: https://patchwork.ozlabs<email address hidden>/
Acked-by: Eelco Chaudron <email address hidden>
Signed-off-by: Ilya Maximets <email address hidden>
Some venerable AMD processors do not support querying extended features
(EAX=7) with cpuid.
In this case, it is not a programmatic error and the runtime check should
simply return the isa is unsupported.
Reported-by: Davide Repetto <email address hidden>
Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=2211747
Fixes: b366fa2f4947 ("dpif-netdev: Call cpuid for x86 isa availability.")
Signed-off-by: David Marchand <email address hidden>
Signed-off-by: Ilya Maximets <email address hidden>
The tc module combines the use of the `tc_transact` helper
function for communication with the in-kernel tc infrastructure
with assertions on the reply data by `ofpbuf_at_assert` on the
received data prior to further processing.
With the presence of bugs on the kernel side, we need to treat
the kernel as an unreliable service provider and replace assertions
on the reply from it with checks to avoid a fatal crash of OVS.
For the record, the symptom of the crash is this in the log:
EMER|include/openvswitch/ofpbuf.h:194:
assertion offset + size <= b->size failed in ofpbuf_at_assert()
And an excerpt of the backtrace looks like this:
ofpbuf_at_assert (offset=16, size=20) at include/openvswitch/ofpbuf.h:194
tc_replace_flower at lib/tc.c:3223
netdev_tc_flow_put at lib/netdev-offload-tc.c:2096
netdev_flow_put at lib/netdev-offload.c:257
parse_flow_put at lib/dpif-netlink.c:2297
try_send_to_netdev at lib/dpif-netlink.c:2384
GCC now reports uninitialized warnings from function return values.
../lib/netdev-dpdk.c: In function 'netdev_dpdk_mempool_configure':
../lib/netdev-dpdk.c:964:22: warning: 'dmp' may be used uninitialized [-Wmaybe-uninitialized]
964 | dev->dpdk_mp = dmp;
| ~~~~~~~~~~~~~^~~~~
../lib/netdev-dpdk.c:854:21: note: 'dmp' was declared here
854 | struct dpdk_mp *dmp, *next;
| ^~~
NB: this looks like a false positive, gcc 13 probably fails to see the link
between reuse and dmp in dpdk_mp_get().
Reviewed-by: David Marchand <email address hidden>
Signed-off-by: Robin Jarry <email address hidden>
Signed-off-by: Ilya Maximets <email address hidden>