Segfault in gcache::RingBuffer::get_new_buffer()

Bug #1152565 reported by Tomasz Klekot
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Galera
Status tracked in 3.x
2.x
Fix Released
Medium
Alex Yurchenko
3.x
Fix Released
Medium
Alex Yurchenko
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
Status tracked in 5.6
5.5
Fix Released
Undecided
Unassigned
5.6
Fix Released
Undecided
Unassigned

Bug Description

I noticed that out of nowhere, our cluster shrinked from 3 servers to 2 servers. Seems like one of our nodes crashed approx 2 hours after it synced with the rest of the cluster(the servers runs in GMT+2 tz).
The server was not overloaded and has plenty pf memory:
# free -m
             total used free shared buffers cached
Mem: 15922 12086 3836 0 129 6459
-/+ buffers/cache: 5497 10425
Swap: 1952 132 1820

The server did not face any kind of performance issues, just crashed.

130307 22:09:35 [Note] WSREP: sst_donor_thread signaled with 0
130307 22:09:35 [Note] WSREP: Flushing tables for SST...
130307 22:09:35 [Note] WSREP: Provider paused at f4fc2dac-8761-11e2-0800-6accc9ba6bc6:4558
130307 22:09:35 [Note] WSREP: Tables flushed.
130307 22:13:04 [Note] WSREP: Provider resumed.
130307 22:13:04 [Note] WSREP: 1 (): State transfer to 0 () complete.
130307 22:13:04 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 4587)
130307 22:13:04 [Note] WSREP: Member 1 () synced with group.
130307 22:13:04 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 4587)
130307 22:13:04 [Note] WSREP: Synchronized with group, ready for connections
130307 22:13:04 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130307 22:13:08 [Note] WSREP: 0 (): State transfer from 1 () complete.
130307 22:13:08 [Note] WSREP: Member 0 () synced with group.
21:59:05 UTC - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Please help us make Percona Server better by reporting any
bugs at http://bugs.percona.com/

key_buffer_size=1048576
read_buffer_size=131072
max_used_connections=292
max_threads=2000
thread_count=14
connection_count=14
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 262426320 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0 thread_stack 0x40000
/usr/sbin/mysqld(my_print_stacktrace+0x35)[0x7c83f5]
/usr/sbin/mysqld(handle_fatal_signal+0x4a4)[0x6a1e04]
/lib64/libpthread.so.0(+0xf500)[0x7f4b6f55a500]
/usr/lib64/libgalera_smm.so(_ZN6gcache10RingBuffer14get_new_bufferEl+0xe5)[0x7f4b6d759545]
/usr/lib64/libgalera_smm.so(_ZN6gcache10RingBuffer6mallocEl+0x39)[0x7f4b6d7597e9]
/usr/lib64/libgalera_smm.so(_ZN6gcache6GCache6mallocEl+0x97)[0x7f4b6d75aed7]
/usr/lib64/libgalera_smm.so(gcs_defrag_handle_frag+0x92)[0x7f4b6d806cb2]
/usr/lib64/libgalera_smm.so(gcs_core_recv+0x489)[0x7f4b6d80c5c9]
/usr/lib64/libgalera_smm.so(+0x156080)[0x7f4b6d813080]
/lib64/libpthread.so.0(+0x7851)[0x7f4b6f552851]
/lib64/libc.so.6(clone+0x6d)[0x7f4b6e9d911d]
You may download the Percona Server operations manual by visiting
http://www.percona.com/software/percona-server/. You may find information
in the manual which will help you identify the cause of the crash.

We are running:
Server version: 5.5.28-log Percona XtraDB Cluster (GPL), wsrep_23.7.r3821
CentOS release 6.3 (Final)

Related branches

Revision history for this message
Alex Yurchenko (ayurchen) wrote :

Most likely gcache file got corrupted on disk. Is it reproducible?

Revision history for this message
Tomasz Klekot (tomksoft) wrote :

So far everything works as it should and I did not manage to observe this problem again. I just rejoined that node (resynced data) to get it back operational.
Regarding file corruption - what I am pretty sure about is it could not be hardware corruption. The OS does not report any filesystem errors, the server runs two brand new Intel SSDs in Raid1.

Revision history for this message
Alex Yurchenko (ayurchen) wrote :

There was a case when customized rsync SST script corrupted gcache file. So this can't be ruled out. Another possibility is that there is a bug in gcache code, but it is equally improbable. So far this is the only such report. And the code was not changed for years already.

Revision history for this message
Kenny Gryp (gryp) wrote :
Revision history for this message
Mrten (bugzilla-ii) wrote :
Revision history for this message
Alex Yurchenko (ayurchen) wrote :
summary: - MySQLd crashed during normal operations
+ Segfault in gcache::RingBuffer::get_new_buffer()
Revision history for this message
Alex Yurchenko (ayurchen) wrote :
Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1306

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.