I want to start off by saying that there are a lot of causes for the error in the title of this article, most of which are easy to fix conditions centering around misconfiguration. However none of them applied to our situation here.
After weeks/months of troubleshooting and isolating on a customer’s (unfortunately production) servers we believe we’ve identified a key bug in MySQL’s row based replication method (RBR) which produces errors like this on the “slave” server (we’re using GTID multi-master so both servers are slaved to each other):
Could not execute Update_rows event on table customer1.table1; Can't find record in 'table1', Error_code: 1032; handler error HA_ERR_END_OF_FILE; the event's master log db1-bin.000001, end_log_pos 344884 Could not execute Update_rows event on table customer1.table1; Can't find record in 'table1', Error_code: 1032; handler error HA_ERR_END_OF_FILE; the event's master log db1-bin.000001, end_log_pos 9685352 Could not execute Update_rows event on table customer2.table1; Can't find record in 'table1', Error_code: 1032; handler error HA_ERR_END_OF_FILE; the event's master log db1-bin.000001, end_log_pos 28653123 Could not execute Delete_rows event on table customer3.table2; Can't find record in 'table2', Error_code: 1032; handler error HA_ERR_END_OF_FILE; the event's master log db1-bin.000001, end_log_pos 179801477
Our eventual conclusion was that these tables did not have any indexes which had a UNIQUE value set, which apparently causes big problems for RBR. Once we added an index which did have UNIQUE set, the problem vanished. In our case we created an extra column specifically for this and indexed that so it didn’t interfere with the existing data layout.
We’ve opened a MySQL bug report for this but have yet to see any movement on it:
MySQL Bugs: #78061 GTID replication halts with Error_code: 1032; handler error HA_ERR_END_OF_FILE
We also created a script which identifies all tables in all databases that meet this condition:
Hopefully this helps others become aware of this and work around this issue.
Very interesting. I just checked MySql Bug #78061 and no answer yet. Good thing you addressed and fixed your issue.
We have the same problem however, the work around that you did in fixing your problem doesn’t work for us. All of our tables has primary keys which were set to auto-increment.
Now, we don’t know how to fix this problem anymore.
We’ve got the same problem here. After running SQL_SLAVE_SKIP_COUNTER a couple of times and restating the slave server, the replication starts again. But this is not ideal as we need to constantly monitor the status.
Do you think that switching from RBR to Mixed, would be a good idea?
Also has anyone test that with MariaDB?
If you’re having to skip transactions, in my view, that isn’t a workable solution. Not only will this result in corruption or differences in the data between the two peers, it indicates there is still an underlying problem. I’ve never had a situation where we “just had to skip one transaction and everything was fine”. There will always be more. You’re also risking replication stopping randomly (possibly unnoticed if you’re not monitoring it closely) and then having to fail over to a peer with old data, too.
For the customer that caused this issue to come up for us we tested extensively RBR, Mixed and Statement and for them RBR was the best. For this situation this was due to a lot of non-deterministic stored procedures which was producing different data on each peer which then, later, was causing replication failures due to collisions.
For your situation this might not make sense. Generally I prefer Mixed or Statement replication unless you, like our above situation, absolutely had to use RBR.