MDEV-30421 arbitrary transaction dependency for optimistic parallel slave
This commit initiates a general framework to control data dependent transactions
on slave so that children of a particular dependency class (e.g defined
by the user as --parallel-ignore-db etc see SAMU-54) are to wait for
their parents' commits before they are allowed to start.
The data dependency condition is handled very similarly to one of
SPECULATE_WAIT, actually generalizing the latter as follows.
When two trx:s Ok-i -> Ok
(O - the notion of transaction from SAME-54, k-i, k - trx indexes in
binlog like gtid.seq_no)
depend say having updated the same "O" db), Ok has to
wait for Ok-i at the same execution point as a normal SPECULATE_WAIT
does. Afterward Ok registers to wait for ordered commit (by Tk-1 now)
and goes about its business.
The parent Ok-i normal workflow remains intact.
TODO:
This poc commit implies only one dependency class and the latter
simulated by "reusing" SET @@session.skip_parallel_replication=1
(as this commit is on the top of 10.4, rather that SAMU-54 branch).
- the limitation needs to be lifted
- the fact of trx belong to a specific data dependency class needs
flagging in Gtid with a respective class' identifier
- SAMU-54 user interface to define dependencies needs some more
thinking.
E.g
1. databases A,B listed in --parallel-ignore-db=A,B should represent
two disjoint dependency classes (while of course in practice they
may be not - through FK). If that's acceptable A and B should lead
to different Gtid's dependency class identifiers.
2. Dependency between trx:s are computable in ROW format to render
void the purpose of --parallel-* rules/hints while the task of
'computing' may need some design efforts.
Extended other tests to wait for the slave sql thread
to notify worker threads to abort before allowing paused
transactions to continue. Otherwise there would be a
potential race condition where a thread could continue
to the commit stage before noticing the abort.
* Code cleanup
* Reverted rpl_parallel_optimistic_error_stop test to old
changes as the non-determinism was fixed by a later commit
* Moved previously general test case from
rpl_stm_par_stop_slave_quick to the _common test file so
the behavior is validated in row format as well
* Added T,N,T,N,T test case to ensure transactions are
executed up to the last N, and not after
* Misc test case improvements
Note the rpl_parallel.test wait_for_done() debug_sync comment
has not yet been addessed
instead of introducing GCO's into wait_for_prior_commit2
to skip the wait and not unregister the wait; this patch
uses rgi->worker_error to skip this call, so any threads
which error will wait in the final thd->wait_for_prior_commit
call of finish_event_group().
If a future transaction in do_gco_wait is
killed, it will skip all future waits and
finish up, including the garbage collection
of previous GCOs, which may still be in
their commit phase.
This incremental patch fixes this logic
so transactions from future GCOs will not
remove their wait conditions, and if it
is the last wait_for_prior_commit call in
finish_event_group, it will wait for the
commit before moving on to garbage
collection.
rgi->unmark_start_commit() in
signal_error_to_sql_driver_thread() would cause the
next gco to start and cause a race condition that would
gc the previous gcos before their respective transactions
had finished.
This commit proposes a fix to disregard the
unmark_start_commit() if the thread is killed/aborting.
Note the new variable rgi->aborted will be replaced by
a local variable and parameter to signal_error_to..()
later