Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC

Segfault in gcache::RingBuffer::get_new_buffer()

Bug #1152565 reported by Tomasz Klekot on 2013-03-08

This bug affects 2 people

	Status	Importance	Assigned to	Milestone
Galera	Status tracked in 3.x
2.x	Fix Released	Medium	Alex Yurchenko	Galera 25.2.9
3.x	Fix Released	Medium	Alex Yurchenko	Galera 25.3.5
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC	Status tracked in 5.6
5.5	Fix Released	Undecided	Unassigned	Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC 5.5.37-25.10
5.6	Fix Released	Undecided	Unassigned	Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC galera-3.5

Bug Description

I noticed that out of nowhere, our cluster shrinked from 3 servers to 2 servers. Seems like one of our nodes crashed approx 2 hours after it synced with the rest of the cluster(the servers runs in GMT+2 tz).
The server was not overloaded and has plenty pf memory:
# free -m
total used free shared buffers cached
Mem: 15922 12086 3836 0 129 6459
-/+ buffers/cache: 5497 10425
Swap: 1952 132 1820

The server did not face any kind of performance issues, just crashed.

130307 22:09:35 [Note] WSREP: sst_donor_thread signaled with 0
130307 22:09:35 [Note] WSREP: Flushing tables for SST...
130307 22:09:35 [Note] WSREP: Provider paused at f4fc2dac-8761-11e2-0800-6accc9ba6bc6:4558
130307 22:09:35 [Note] WSREP: Tables flushed.
130307 22:13:04 [Note] WSREP: Provider resumed.
130307 22:13:04 [Note] WSREP: 1 (): State transfer to 0 () complete.
130307 22:13:04 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 4587)
130307 22:13:04 [Note] WSREP: Member 1 () synced with group.
130307 22:13:04 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 4587)
130307 22:13:04 [Note] WSREP: Synchronized with group, ready for connections
130307 22:13:04 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130307 22:13:08 [Note] WSREP: 0 (): State transfer from 1 () complete.
130307 22:13:08 [Note] WSREP: Member 0 () synced with group.
21:59:05 UTC - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Please help us make Percona Server better by reporting any
bugs at http://bugs.percona.com/

key_buffer_size=1048576
read_buffer_size=131072
max_used_connections=292
max_threads=2000
thread_count=14
connection_count=14
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 262426320 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0 thread_stack 0x40000
/usr/sbin/mysqld(my_print_stacktrace+0x35)[0x7c83f5]
/usr/sbin/mysqld(handle_fatal_signal+0x4a4)[0x6a1e04]
/lib64/libpthread.so.0(+0xf500)[0x7f4b6f55a500]
/usr/lib64/libgalera_smm.so(_ZN6gcache10RingBuffer14get_new_bufferEl+0xe5)[0x7f4b6d759545]
/usr/lib64/libgalera_smm.so(_ZN6gcache10RingBuffer6mallocEl+0x39)[0x7f4b6d7597e9]
/usr/lib64/libgalera_smm.so(_ZN6gcache6GCache6mallocEl+0x97)[0x7f4b6d75aed7]
/usr/lib64/libgalera_smm.so(gcs_defrag_handle_frag+0x92)[0x7f4b6d806cb2]
/usr/lib64/libgalera_smm.so(gcs_core_recv+0x489)[0x7f4b6d80c5c9]
/usr/lib64/libgalera_smm.so(+0x156080)[0x7f4b6d813080]
/lib64/libpthread.so.0(+0x7851)[0x7f4b6f552851]
/lib64/libc.so.6(clone+0x6d)[0x7f4b6e9d911d]
You may download the Percona Server operations manual by visiting
http://www.percona.com/software/percona-server/. You may find information
in the manual which will help you identify the cause of the crash.

We are running:
Server version: 5.5.28-log Percona XtraDB Cluster (GPL), wsrep_23.7.r3821
CentOS release 6.3 (Final)

Related branches

lp:galera/2.x

lp:galera

Ready for review for merging into lp:~dbpercona/galera/Bug1348714

David Bennett: Pending requested 2014-07-25

Revision history for this message

Alex Yurchenko (ayurchen) wrote on 2013-03-26:

Most likely gcache file got corrupted on disk. Is it reproducible?

Revision history for this message

Tomasz Klekot (tomksoft) wrote on 2013-03-26:

So far everything works as it should and I did not manage to observe this problem again. I just rejoined that node (resynced data) to get it back operational.
Regarding file corruption - what I am pretty sure about is it could not be hardware corruption. The OS does not report any filesystem errors, the server runs two brand new Intel SSDs in Raid1.

Revision history for this message

Alex Yurchenko (ayurchen) wrote on 2013-03-26:

There was a case when customized rsync SST script corrupted gcache file. So this can't be ruled out. Another possibility is that there is a bug in gcache code, but it is equally improbable. So far this is the only such report. And the code was not changed for years already.

Revision history for this message

Kenny Gryp (gryp) wrote on 2013-07-12:

Alex, I've seen more reports about this now:

- https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1152565 (this issue)
- https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1200551 (created 2 hours ago)
- And got another customer having the same crash
- https://groups.google.com/forum/#!topic/percona-discussion/274WO-CjmAg
- https://groups.google.com/forum/#!msg/percona-discussion/7uMpkWP5HOs/WSWchtMc-FkJ