NOTE: The MySQL Cluster repositories on Launchpad have been frozen. Future updates will be published on github instead.
- 5125. By Ole John Aske <email address hidden> on 2015-01-21
Fix regression introduced by fix for bug#19524096.
That fix caused ndb_global_
schema_ lock_error to fail as mysqld
ended up in a deadlock between TransporterFacade and ClusterMgr mutex
:is_cluster_ completely_ unavailable( ). There
likely are other deadlock scenarios also.
The fix is to remove the Guard locking clusterMgrThrea
The rational and (lack of) correctness for this is discussed in
a mail pasted below.
There are also a few small improvements to ::is_cluster_
completely_ unavailable( )
1. Dont copy entire 'trp_node' struct but read it by refference instead.
2. Replace usage of 'm_impl-
>m_transporter_ facade' with 'getTransporter()'
as this is the pattern used elsewhere in this code block.
--------------- Pasted mail with some background ----------------
Subject: (Dead-) locking of ClusterMgr:
:theNodes[ ] structures
Writing this as a note to myself and others after having analyzed
the failure of ndb_global_
schema_ lock_error. test. That test
timed out as mysqld ended up in a deadlock between
:clusterMgrThre adMutex and TransporterFaca de::theMutexPtr
ClusterMgr maintains node state & info in theNodes. From external
components, this info is access through ClusterMgr:
theNodes are only updated from within ClustMgr, all external access
is read only.
Updates to theNodes are partly done from withing several ClustMgr::exec<foo>
methods, and partly from ClusterMgr:
:reportConnecte d() / ::reportDisconn ected() .
All updates seems to be done with ClusterMgr:
:clusterMgrThre adMutex locked.
Several ClusterMgr methods are available for inspecting node status
from other components, these all use information from theNodes.
Some of the more commonly used of these methods are:
de::get_ node_alive( n) (Reads 'theClusterMgr- >getNodeInfo( n)')
de::get_ an_alive_ node() (Implemented on top of ::get_node_ alive(n) )
:get_node_ stopping( n) (Reads 'theClusterMgr- >getNodeInfo( n)')
:get_node_ alive(n) (Reads 'theClusterMgr- >getNodeInfo( n)')
- NdbImpl::get<foo> ...A lot more node state getters....
The locking schema used to provide atomicity of theNodes for the above
methods are .... mixed, and not really defined at all as far as I can tell.
e::dictSignal( ), NdbDictInterfac e::forceGCPWait () & NdbDictInterfac e::listObjects( ):
Before calling get_node_alive() / ::get_an_
alive_node( ), a PollGuard is set.
PollGuard calls trp_client:
:start_ poll() which is different pre/post 7.3:
- Pre 7.3, trp_client:
de::start_ poll -
-> lock TransporterFacade mutex (a global mutex)
- 7.3 -> trp_client:
:start_ poll, lock trp_client m_mutex.
de::start_ poll ...no locking, and mutex gone in this version
Observations: There are no locking of ClusterMgr:
:clusterMgrThre adMutex here,
neither pre/post 7.3 .
connection: :wait_until_ ready() ,
connection_ impl::get_ next_alive_ node()
connection: :get_no_ ready() :
These all sets the TransporterFaca
Sets a PollGuard as above, which either lock the TransporterFacade or
the trp_client mutex
Documents in comments that TransporterFacade mutex should be locked prior to call
So this has become a total mess. It might seem like that it prior
to 7.3 was the intention that TransporterFacade mutex should be held
when accessing theNodes, or any methods that access it itself.
After 7.3 a mix of TransporterFacade and Trp_client mutex is effectively used
Updating looks like it sets the ClusterMgr mutex to protect these, which will
of course not work as the reader doesnt set this mutex. However, it could be
that all updates happens at places where it is called from the TransporterFacade.
Here we *used to* hold the TransporterFacade mutex prior to 7.3, which would make
some sense. This all needs more investigation .... and some documentation in the code...
In the current state there certainly are no effective concurrency protection of
the node state info in 7.3+ , It could be that it work in 7.1 & 7.2
On top of this the fix for bug#19524096 introduced
:is_cluster_ completely_ unavailable( ) which is also based on
the nodes status available in theNodes. Probably in good faith,
backed by that updates of theNodes was protected with clusterMgrThrea
that method grabs that lock before accessing theNodes info.
Actually this method is *the only* node state getters which
does any explicit locking. ... and it was doomed to fail as this
was completely untested territory. See other mail about how
it could deadlock with the TranporterFacade mutex.
Sidenote: any other node state getters attempting to follow the
same locking pattern had likely deadlocked the same way.
completely_ unavailable( ) is called from within code
which also calls ::get_node_alive() and ::get_an_
without using any locking protection for these. Based on this I will
propose a patch for the bug#19524096 regression, which simply
removes the mutex locking from ::is_cluster_
completely_ unavailable( ).
This will not be entirely kosher based on how the shared node state
structs should have been mutex protected. However, based on my discussion
above, there are already so many violations in this area that a single
more should not matter. A bigger effort should be taken to clean up this entirely.
- 5123. By Mauritz Sundell <email address hidden> on 2014-12-11
Bug#19582925 ENHANCED DETECTION AND DUMP OF CORRUPT OR UNSUPPORTED MESSAGES
- 5122. By Pekka Nousiainen <email address hidden> on 2014-12-09
testBasic -n Bug16834333 : fix dict cache crash
- 5121. By Ole John Aske <email address hidden> on 2014-12-08
Updated fix for bug#20113145:
:doSend( ) Fails to update send buffers after a partly failed send
If some data were sent (sum_sent > 0) before my_socket_writev
returned with an error, we returned immediately without
calling iovec_data_sent() which maintains the send buffers.
Thus another ::doSend() will fetch some of the already sent
data. Iff that send does not fail, the prev sent data is
actually sent twice - Resulting in garbage received by the
- 5120. By Ole John Aske <email address hidden> on 2014-12-08
Follow up to fix for bug#bug#20166585:
EXCESSIVE MEMORY USAGE: NDBSCANOPERATIO
N::CLOSE( ) DOES NOT RELEASE BUFFER MEM.
Frazer had some concerns that the delete of m_scan_buffer is
done by ::close_impl while we hold a poll_guard on the client.
This lock was not required for doing the delete, so the
delete operation is moved from ::close_impl()
into ::close() to a place immediately after the
return from close_impl(). (and the release of poll_guard.)
- 5119. By Ole John Aske <email address hidden> on 2014-12-08
Update to recent fix where we failed to close scan cursors
at the end of their lifetime.
As we are not interested in these NdbScanOperation objects
anymore at all, we could as well tell ::close() to
even release the NdbScanOperations at this point.
- 5118. By Ole John Aske <email address hidden> on 2014-12-08
Update 'v3' fix for bug#20166585:
EXCESSIVE MEMORY USAGE: NDBSCANOPERATIO
N::CLOSE( ) DOES NOT RELEASE BUFFER MEM.
The 'scan buffer', used to store rows fetched from a scan batch,
can easily become rather large (Multi MB). It is thus important
that we dont keep this structure longer than necessary.
The ::close() of a NdbScanOperation terminates navigation of
the scan result set. Thus, this is a natural place to release
this scan buffer.
This patch moves the scan buffer release point from
n::release to ::close().
NOTE: ::release() terminates the lifecycle of the
ScanOperation itself. ::close() terminates the lifecycle
of the result set - Thus, ::close() is the natural place to
release such buffer memory.
- 5117. By Ole John Aske <email address hidden> on 2014-12-06
Fix for various AutoTest 'testIndex ...' crashing due to
excessive memory usage.
The verifyIndex() utility used by several of the testIndex testcases
opened a NdbIndexScanOpe
ration for every rows fetched from
ration was not 'closed' at the end of its
lifecycle, and was thus not cleaned up until the transaction, and the
enclosing NdbScanOperation was closed/cleaned.
Thus a waste amount of scan buffer memory us occupied, and tests
are failing due to this.
NOTE: This patch alone is not sufficient to fix the failing AutoTest.
The bug 20166585 has been filed in addition, regarding that a ::close()
should release the scan buffer memory.
- 5116. By Pratik Patodi <email address hidden> on 2014-12-05
Rewriting the mysql-test/
mysql-test- run.pl hack for "BUG#17069285 SEGMENTATION FAULT IN NDB_PRINT_FILE"
- Branch format:
- Branch format 7
- Repository format:
- Bazaar repository format 2a (needs bzr 1.16 or later)
- Stacked on: