Node consistency issues with foreign key grammar

Bug #1260713 reported by Raghavendra D Prabhu
42
This bug affects 9 people
Affects Status Importance Assigned to Milestone
MySQL patches by Codership
In Progress
Undecided
Seppo Jaakola
5.5
New
Undecided
Seppo Jaakola
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
Status tracked in 5.6
5.5
Fix Released
Medium
Unassigned
5.6
Fix Released
Medium
Unassigned

Bug Description

Run as:

        perl runall-new.pl --basedir=${MYSQL_BASEDIR} --vardir=$CVARDIR --galera=ms --reporter=Shutdown,Backtrace,ErrorLog,ErrorLogAlarm,Deadlock --grammar=conf/galera/galera_stress-fk.yy --gendata=conf/galera/galera_stress-56.zz --threads=16 --queries=50000 --mysqld=--wsrep-provider=$MYSQL_BASEDIR/lib/libgalera_smm.so --mysqld=--innodb_flush_method=O_DIRECT --seed=time --mysqld=--core-file

The grammars - galera_stree-fk.yy and galera_stress-56.zz are in bzr+ssh://bazaar.launchpad.net/~raghavendra-prabhu/randgen/pxc/ tree.

Node0:

2013-12-11 16:08:06 25095 [ERROR] Slave SQL: Could not execute Delete_rows event on table test.table100_innodb_int; Cannot delete or update a parent row: a foreign key constraint fails (`test`.`u`, CONSTRAINT `u_ibfk_1` FOREIGN KEY (`col_char_12_key`) REFERENCES `table100_innodb_int` (`col_char_12_key`) ON DELETE NO ACTION), Error_code: 1451; handler error HA_ERR_ROW_IS_REFERENCED; the event's master log FIRST, end_log_pos 234, Error_code: 1451
2013-12-11 16:08:06 25095 [Warning] WSREP: RBR event 3 Delete_rows apply warning: 152, 101
2013-12-11 16:08:06 25095 [Warning] WSREP: Failed to apply app buffer: seqno: 101, status: 1
  at galera/src/trx_handle.cpp:apply():340
Retrying 2th time
2013-12-11 16:08:06 25095 [ERROR] Slave SQL: Could not execute Delete_rows event on table test.table100_innodb_int; Cannot delete or update a parent row: a foreign key constraint fails (`test`.`u`, CONSTRAINT `u_ibfk_1` FOREIGN KEY (`col_char_12_key`) REFERENCES `table100_innodb_int` (`col_char_12_key`) ON DELETE NO ACTION), Error_code: 1451; handler error HA_ERR_ROW_IS_REFERENCED; the event's master log FIRST, end_log_pos 234, Error_code: 1451
2013-12-11 16:08:06 25095 [Warning] WSREP: RBR event 3 Delete_rows apply warning: 152, 101
2013-12-11 16:08:06 25095 [Warning] WSREP: Failed to apply app buffer: seqno: 101, status: 1
  at galera/src/trx_handle.cpp:apply():340
Retrying 3th time
2013-12-11 16:08:06 25095 [ERROR] Slave SQL: Could not execute Delete_rows event on table test.table100_innodb_int; Cannot delete or update a parent row: a foreign key constraint fails (`test`.`u`, CONSTRAINT `u_ibfk_1` FOREIGN KEY (`col_char_12_key`) REFERENCES `table100_innodb_int` (`col_char_12_key`) ON DELETE NO ACTION), Error_code: 1451; handler error HA_ERR_ROW_IS_REFERENCED; the event's master log FIRST, end_log_pos 234, Error_code: 1451
2013-12-11 16:08:06 25095 [Warning] WSREP: RBR event 3 Delete_rows apply warning: 152, 101
2013-12-11 16:08:06 25095 [Warning] WSREP: Failed to apply app buffer: seqno: 101, status: 1
  at galera/src/trx_handle.cpp:apply():340
Retrying 4th time
2013-12-11 16:08:06 25095 [ERROR] Slave SQL: Could not execute Delete_rows event on table test.table100_innodb_int; Cannot delete or update a parent row: a foreign key constraint fails (`test`.`u`, CONSTRAINT `u_ibfk_1` FOREIGN KEY (`col_char_12_key`) REFERENCES `table100_innodb_int` (`col_char_12_key`) ON DELETE NO ACTION), Error_code: 1451; handler error HA_ERR_ROW_IS_REFERENCED; the event's master log FIRST, end_log_pos 234, Error_code: 1451
2013-12-11 16:08:06 25095 [Warning] WSREP: RBR event 3 Delete_rows apply warning: 152, 101
2013-12-11 16:08:06 25095 [Warning] WSREP: failed to replay trx: source: 1f04d2aa-627e-11e3-9356-fbdcfc53c70d version: 3 local: 1 state: REPLAYING flags: 129 conn_id: 13 trx_id: 2373 seqnos (l: 110, g: 101, s: 97, d: 100, ts: 16968201220275789)
2013-12-11 16:08:06 25095 [Warning] WSREP: Failed to apply trx 101 4 times
2013-12-11 16:08:06 25095 [ERROR] WSREP: trx_replay failed for: 6, query: void
2013-12-11 16:08:06 25095 [ERROR] Aborting

Node1:

2013-12-11 16:08:07 25141 [ERROR] Slave SQL: Could not execute Delete_rows event on table test.table100_innodb_int; Cannot delete or update a parent row: a foreign key constraint fails (`test`.`u`, CONSTRAINT `u_ibfk_1` FOREIGN KEY (`col_char_12_key`) REFERENCES `table100_innodb_int` (`col_char_12_key`) ON DELETE NO ACTION), Error_code: 1451; handler error HA_ERR_ROW_IS_REFERENCED; the event's master log FIRST, end_log_pos 234, Error_code: 1451
2013-12-11 16:08:07 25141 [Warning] WSREP: RBR event 3 Delete_rows apply warning: 152, 101
2013-12-11 16:08:07 25141 [Warning] WSREP: Failed to apply app buffer: seqno: 101, status: 1
  at galera/src/trx_handle.cpp:apply():340
Retrying 2th time
2013-12-11 16:08:07 25141 [ERROR] Slave SQL: Could not execute Delete_rows event on table test.table100_innodb_int; Cannot delete or update a parent row: a foreign key constraint fails (`test`.`u`, CONSTRAINT `u_ibfk_1` FOREIGN KEY (`col_char_12_key`) REFERENCES `table100_innodb_int` (`col_char_12_key`) ON DELETE NO ACTION), Error_code: 1451; handler error HA_ERR_ROW_IS_REFERENCED; the event's master log FIRST, end_log_pos 234, Error_code: 1451
2013-12-11 16:08:07 25141 [Warning] WSREP: RBR event 3 Delete_rows apply warning: 152, 101
2013-12-11 16:08:07 25141 [Warning] WSREP: Failed to apply app buffer: seqno: 101, status: 1
  at galera/src/trx_handle.cpp:apply():340
Retrying 3th time
2013-12-11 16:08:07 25141 [ERROR] Slave SQL: Could not execute Delete_rows event on table test.table100_innodb_int; Cannot delete or update a parent row: a foreign key constraint fails (`test`.`u`, CONSTRAINT `u_ibfk_1` FOREIGN KEY (`col_char_12_key`) REFERENCES `table100_innodb_int` (`col_char_12_key`) ON DELETE NO ACTION), Error_code: 1451; handler error HA_ERR_ROW_IS_REFERENCED; the event's master log FIRST, end_log_pos 234, Error_code: 1451
2013-12-11 16:08:07 25141 [Warning] WSREP: RBR event 3 Delete_rows apply warning: 152, 101
2013-12-11 16:08:07 25141 [Warning] WSREP: Failed to apply app buffer: seqno: 101, status: 1
  at galera/src/trx_handle.cpp:apply():340
Retrying 4th time
2013-12-11 16:08:07 25141 [ERROR] Slave SQL: Could not execute Delete_rows event on table test.table100_innodb_int; Cannot delete or update a parent row: a foreign key constraint fails (`test`.`u`, CONSTRAINT `u_ibfk_1` FOREIGN KEY (`col_char_12_key`) REFERENCES `table100_innodb_int` (`col_char_12_key`) ON DELETE NO ACTION), Error_code: 1451; handler error HA_ERR_ROW_IS_REFERENCED; the event's master log FIRST, end_log_pos 234, Error_code: 1451
2013-12-11 16:08:07 25141 [Warning] WSREP: RBR event 3 Delete_rows apply warning: 152, 101
2013-12-11 16:08:07 25141 [ERROR] WSREP: Failed to apply trx: source: 1f04d2aa-627e-11e3-9356-fbdcfc53c70d version: 3 local: 0 state: APPLYING flags: 129 conn_id: 13 trx_id: 2373 seqnos (l: 108, g: 101, s: 97, d: 100, ts: 16968201220275789)
2013-12-11 16:08:07 25141 [ERROR] WSREP: Failed to apply trx 101 4 times
2013-12-11 16:08:07 25141 [ERROR] WSREP: Node consistency compromized, aborting...

Full logs:
http://paste.wnohang.net/03031d - node0
http://paste.wnohang.net/62498a - node1

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

This may be due to ON DELETE NO ACTION which breaks FK:

2013-12-11 16:08:07 25141 [ERROR] Slave SQL: Could not execute Delete_rows event on table test.table100_innodb_int; Cannot delete or update a parent row: a foreign key constraint fails (`test`.`u`, CONSTRAINT `u_ibfk_1` FOREIGN KEY (`col_char_12_key`) REFERENCES `table100_innodb_int` (`col_char_12_key`) ON DELETE NO ACTION), Error_code: 1451; handler error HA_ERR_ROW_IS_REFERENCED; the event's master log FIRST, end_log_pos 234, Error_code: 1451

So, testing by removing it for now.

tags: added: i38269
Revision history for this message
markus_albe (markus-albe) wrote :

Attaching .cnf and error logs for recent case happening 2 nodes with 5.5.34-55-log Percona XtraDB Cluster (GPL), wsrep_25.9.r3928
Arbiter: Percona-XtraDB-Cluster-galera-2.x86_64 2.8-1.157.rhel6

Case happened near 8:41:31

Here are the tables:

CREATE TABLE `person_completed_activity` (
`person_completed_activity_key` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`person_key` bigint(20) unsigned NOT NULL,
`activity_key` bigint(20) unsigned NOT NULL,
`challenge_key` bigint(20) unsigned DEFAULT NULL,
`team_key` bigint(20) unsigned DEFAULT NULL,
`completed_date` datetime NOT NULL,
`source_key` bigint(20) unsigned DEFAULT NULL,
`description` varchar(300) DEFAULT NULL,
`lock_cnt` int(11) unsigned NOT NULL,
PRIMARY KEY (`person_completed_activity_key`),
KEY `person_key` (`person_key`),
KEY `activity_key` (`activity_key`),
KEY `challenge_key` (`challenge_key`),
KEY `team_key` (`team_key`),
KEY `person_completed_activity_source` (`source_key`),
CONSTRAINT `person_completed_activity_ibfk_1` FOREIGN KEY (`person_key`) REFERENCES `person` (`person_key`),
CONSTRAINT `person_completed_activity_ibfk_2` FOREIGN KEY (`activity_key`) REFERENCES `activity` (`activity_key`),
CONSTRAINT `person_completed_activity_ibfk_3` FOREIGN KEY (`challenge_key`) REFERENCES `challenge` (`challenge_key`),
CONSTRAINT `person_completed_activity_ibfk_4` FOREIGN KEY (`team_key`) REFERENCES `team` (`team_key`),
CONSTRAINT `person_completed_activity_source` FOREIGN KEY (`source_key`) REFERENCES `source` (`source_key`)
) ENGINE=InnoDB AUTO_INCREMENT=30160601 DEFAULT CHARSET=latin1

CREATE TABLE `statistic` (
`statistic_key` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`statistic_type_key` bigint(20) unsigned NOT NULL,
`value` int(11) NOT NULL,
`statistic_date` date DEFAULT NULL,
`modify_user` varchar(15) NOT NULL,
`modify_timestamp` datetime NOT NULL,
`lock_cnt` int(11) NOT NULL,
`person_completed_activity_key` bigint(20) unsigned NOT NULL,
PRIMARY KEY (`statistic_key`),
KEY `statistic_idx001` (`statistic_type_key`),
KEY `statistic_fk006` (`person_completed_activity_key`),
CONSTRAINT `statistic_fk001` FOREIGN KEY (`statistic_type_key`) REFERENCES `statistic_type` (`statistic_type_key`) ON DELETE NO ACTION ON UPDATE NO ACTION,
CONSTRAINT `statistic_fk006` FOREIGN KEY (`person_completed_activity_key`) REFERENCES `person_completed_activity` (`person_completed_activity_key`)
) ENGINE=InnoDB AUTO_INCREMENT=100847135 DEFAULT CHARSET=latin1

Przemek (pmalkowski)
tags: added: i41549
Kenny Gryp (gryp)
tags: added: i42394
Revision history for this message
Kenny Gryp (gryp) wrote :

Can you confirm that foreign_key_checks=0 wasn't enabled in any case? (we need to get slowlog). Then it's easy to reproduce.

Revision history for this message
Alex Yurchenko (ayurchen) wrote :

Ok, so what can we deduce from the last Markus' report:

1) There are serious performance issues on slave:
140414 8:40:38 [Warning] WSREP: last inactive check more than PT1.5S ago (PT4.51434S), skipping check
(probably a result of sync_binlog and innodb_flush_log_at_trx_commit=1)

As a result slave threads may work in a highly concurrent manner, so we may have some serious reces there and transactions that were executed serially on master may be executed in parallel on slave.

2) what can be the cause of failure:
a) Node inconsistency (child table data present on slave were not present on master)
b) foreign key checks turned off on master
c) a bug in mysqld/innodb/xtradb: not all parent keys were included into a writeset, so galera tried to apply two mutually dependent transactions out of order.
d) a bug in galera library so that it failed to correctly calculate dependencies.

- Recent Fred's comments rule out b)
- d) is highly unlikely. The code to compute the dependency is very simple and had it had a such a bug, we'd hear much bigger outcry.
- I imagine that a) is also rather unlikely there.

This leaves us with c). There is nothing special about the failed constraint and no reason why the primary key won't be added to a writeset. However there are many paths that may be invoked in innodb that touch the parent keys and need to be patched to add it to a writeset. Perhaps some sort of statement invoked such path (and that is in agreement with the initial Raghu's report). So if we could have a statement that deleted the child rows, that could have helped.

However a faster fix/workaround would be disabling foreign key checks on slaves (commits are still strictly ordered, so we should never get a situation with dangling parent references)

Revision history for this message
Frederic Descamps (lefred) wrote :

Just an extra comment:

It's very clear from the slow logs and binary logs that on the master a transaction deletes many rows in multiple tables (up to 1k rows or more) and after that a second transaction deletes only one row.

I suspect that on the "slave" as it fails only when wsrep_slave_threads > 1, the second transaction is executed and committed faster.

I don't know if the writeset issue with FK is related of the amount of rows or tables involved.

Revision history for this message
Kenny Gryp (gryp) wrote :

I tried to reproduce such behaviour with a similar schema/queries as with the customer Fred is talking about.

I even set up a node running on a USB stick, which makes it very slow. I wasn't able to get the 2nd transaction to start in parallel on that slow node

Revision history for this message
Ed Fisher (ed-m) wrote :
Download full text (3.8 KiB)

We ran in to this bug (or something similar) in production this morning. The total effect was to bring the entire cluster down, since (as you'll see) two nodes shut themselves down and the third got a Signal 11 2-3 seconds later. There weren't any signs of performance problems on any of the nodes at the time, and the cluster had been stable for quite some time until now.

My only thought here is that Jira could be setting foreign_key_checks=0 before doing some internal maintenance, but that doesn't seem likely. Jira was configured at the time to point to an haproxy listener that spreads writes between all three servers. It no longer is.

Two nodes threw this same error, or very similar:

2014-07-12 07:40:30 13362 [ERROR] Slave SQL: Could not execute Delete_rows event on table jira.AO_E8B6CC_MESSAGE; Cannot delete or update a parent row: a foreign key constraint fails (`jira`.`AO_E8B6CC_MESSAGE_TAG`, CONSTRAINT `fk_ao_e8b6cc_message_tag_message_id` FOREIGN KEY (`MESSAGE_ID`) REFERENCES `AO_E8B6CC_MESSAGE` (`ID`)), Error_code: 1451; handler error HA_ERR_ROW_IS_REFERENCED; the event's master log FIRST, end_log_pos 541, Error_code: 1451
2014-07-12 07:40:30 13362 [Warning] WSREP: RBR event 3 Delete_rows apply warning: 152, 115701779
2014-07-12 07:40:30 13362 [ERROR] WSREP: Failed to apply trx: source: ca04d333-fa74-11e3-94e2-7f5202c0e10c version: 3 local: 0 state: APPLYING flags: 1 conn_id: 24117450 trx_id: 77362570549 seqnos (l: 116300250, g: 115701779, s: 115701751, d: 115701766, ts: 2272383580846541)

One gave me:
https://gist.github.com/gleamicus/820f7b195c049720ea71

Another gave me:
https://gist.github.com/gleamicus/4f56bc7938e6b872ebf1

The third node, however, saw things happen like this:
https://gist.github.com/gleamicus/87cccd694312ce1286c5

-----------

Here's the last delete from AO_E8B6CC_MESSAGE_TAG that appears in the binary logs before the crashes:

# at 279619602
#140712 7:40:28 server id 3298096439 end_log_pos 279619650 CRC32 0xf39f5ed5 GTID [commit=yes]
SET @@SESSION.GTID_NEXT= 'b45609b2-0839-ee1c-76b7-60c776c75110:115691438'/*!*/;
# at 279619650
#140712 7:40:28 server id 3298096439 end_log_pos 279619727 CRC32 0x6b3f9354 Query thread_id=24117450 exec_time=0 error_code=0
SET TIMESTAMP=1405168828/*!*/;
SET @@session.sql_mode=539099136/*!*/;
SET @@session.auto_increment_increment=3, @@session.auto_increment_offset=2/*!*/;
/*!\C utf8 *//*!*/;
SET @@session.character_set_client=33,@@session.collation_connection=33,@@session.collation_server=192/*!*/;
BEGIN
/*!*/;
# at 279619727
#140712 7:40:28 server id 3298096439 end_log_pos 279619795 CRC32 0x2af4292a Table_map: `jira`.`AO_E8B6CC_MESSAGE_TAG` mapped to number 169
# at 279619795
#140712 7:40:28 server id 3298096439 end_log_pos 279619854 CRC32 0x13b86571 Delete_rows: table id 169 flags: STMT_END_F
### DELETE FROM `jira`.`AO_E8B6CC_MESSAGE_TAG`
### WHERE
### @1=70097
### @2=32804
### @3='audit-id-47893'
# at 279619854
#140712 7:40:28 server id 3298096439 end_log_pos 279619885 CRC32 0x73cd7fe6 Xid = 115701629
COMMIT/*!*/;
# at 279619885
#140712 7:40:28 server id 3298096439 end_log_pos 279619933 CRC32 0xb2a3ecec GTID [commit=yes]
SET @@SESSION.GTI...

Read more...

Revision history for this message
Jaykah (jaykah) wrote :

I'm having a similar issue with Jira on Mariadb 10.0.15/Galera latest

Retrying 4th time
141110 6:45:19 [ERROR] Slave SQL: Could not execute Delete_rows event on table jiradb.propertystring; Can't find record in 'propertystring', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 115, Error_code: 1032
141110 6:45:19 [Warning] WSREP: RBR event 2 Delete_rows apply warning: 120, 15129950
141110 6:45:19 [ERROR] WSREP: Failed to apply trx: source: 7e9d4fd3-689d-11e4-95e4-fa96291be89e version: 3 local: 0 state: APPLYING flags: 1 conn_id: 24393 trx_id: 44477282 seqnos (l: 123589, g: 15129950, s: 15129949, d: 15129949, ts: 300451621960720)
141110 6:45:19 [ERROR] WSREP: Failed to apply trx 15129950 4 times
141110 6:45:19 [ERROR] WSREP: Node consistency compromized, aborting...
141110 6:45:19 [Note] WSREP: Closing send monitor...
141110 6:45:19 [Note] WSREP: Closed send monitor.
141110 6:45:19 [Note] WSREP: gcomm: terminating thread
141110 6:45:19 [Note] WSREP: gcomm: joining thread
141110 6:45:19 [Note] WSREP: gcomm: closing backend

Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1093

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.