翼度科技»论坛 编程开发 mysql 查看内容

故障解析丨Clone节点导致主从故障

6

主题

6

帖子

18

积分

新手上路

Rank: 1

积分
18
1.背景概述

在一次主从复制架构中,由于主节点binlog损坏,导致从节点无法正常同步数据,只能重做从节点;因此使用MySQL 8.0.17开始提供的clone技术进行恢复,恢复后的2天都发生了主从报错数据冲突。
通过解析binlog发现,同一时刻主从节点都在执行同一条语句,因此询问业务是否在主从节点都执行了定时任务,业务回复定时任务只在主节点执行。
最后排查发现,克隆后的从节点的定时任务也会是开启的状态,因此同一时刻,主从节点同时执行定时任务,导致主从报错,最终将从节点的定时任务关闭后解决此问题。
2.问题复现

本次测试基于 GreatSQL 8.0.32-24
  1. greatsql> SELECT VERSION();
  2. +-----------+
  3. | VERSION() |
  4. +-----------+
  5. | 8.0.32-24 |
  6. +-----------+
  7. 1 row in set (0.00 sec)
复制代码
1.搭建一套主从架构


2.创建event
  1. greatsql> create database test;
  2. greatsql> use test;
  3. greatsql> CREATE TABLE `test` (
  4. `id` int(11) unsigned NOT NULL AUTO_INCREMENT COMMENT 'ID',
  5. `now` datetime DEFAULT NULL COMMENT '时间',
  6. PRIMARY KEY (`id`)
  7. );
  8. greatsql> CREATE EVENT event_test
  9. ON SCHEDULE EVERY 1 MINUTE
  10. ON COMPLETION PRESERVE
  11. ENABLE
  12. COMMENT '每隔1分钟向test表插入记录'
  13. DO INSERT INTO test VALUES(NULL, now());
复制代码
3.查看event状态

主节点,默认情况下event状态为 ENABLED
  1. greatsql> show events;
  2. +------+------------+---------+-----------+-----------+------------+----------------+----------------+---------------------+------+---------+------------+----------------------+----------------------+--------------------+
  3. | Db  | Name    | Definer | Time zone | Type    | Execute at | Interval value | Interval field | Starts        | Ends | Status  | Originator | character_set_client | collation_connection | Database Collation |
  4. +------+------------+---------+-----------+-----------+------------+----------------+----------------+---------------------+------+---------+------------+----------------------+----------------------+--------------------+
  5. | test | event_test | root@%  | SYSTEM   | RECURRING | NULL    | 1        | MINUTE     | 2023-10-12 17:11:14 | NULL | ENABLED |      1 | utf8mb4        | utf8mb4_unicode_ci  | utf8mb4_unicode_ci |
  6. +------+------------+---------+-----------+-----------+------------+----------------+----------------+---------------------+------+---------+------------+----------------------+----------------------+--------------------+
  7. 1 row in set (0.00 sec)
复制代码
从节点,默认情况下event状态为 SLAVESIDE_DISABLED
  1. greatsql> show events;
  2. +------+------------+---------+-----------+-----------+------------+----------------+----------------+---------------------+------+--------------------+------------+----------------------+----------------------+--------------------+
  3. | Db  | Name    | Definer | Time zone | Type    | Execute at | Interval value | Interval field | Starts        | Ends | Status       | Originator | character_set_client | collation_connection | Database Collation |
  4. +------+------------+---------+-----------+-----------+------------+----------------+----------------+---------------------+------+--------------------+------------+----------------------+----------------------+--------------------+
  5. | test | event_test | root@%  | SYSTEM   | RECURRING | NULL    | 1        | MINUTE     | 2023-10-12 17:11:14 | NULL | SLAVESIDE_DISABLED |      1 | utf8mb4        | utf8mb4_unicode_ci  | utf8mb4_unicode_ci |
  6. +------+------------+---------+-----------+-----------+------------+----------------+----------------+---------------------+------+--------------------+------------+----------------------+----------------------+--------------------+
  7. 1 row in set (0.00 sec)
复制代码
4.查看数据
  1. greatsql> select * from test.test;
  2. +----+---------------------+
  3. | id | now                 |
  4. +----+---------------------+
  5. |  1 | 2023-08-08 16:00:39 |
  6. |  2 | 2023-08-08 16:01:39 |
  7. |  3 | 2023-08-08 16:02:39 |
  8. +----+---------------------+
  9. 3 rows in set (0.00 sec)
复制代码
5.从节点进行克隆

# 安装克隆插件,主从节点都需要
  1. greatsql> install plugin clone soname 'mysql_clone.so';
复制代码
# 从节点进行clone
  1. greatsql> set global clone_valid_donor_list='172.17.137.162:6001';
  2. greatsql> clone instance from root@'172.17.137.162':6001 identified by 'greatsql';
复制代码
6.重新建立主从复制
  1. greatsql> change master to master_user='root',master_password='greatsql',master_host='172.17.137.162',master_port=6001,master_auto_position=1;
  2. Query OK, 0 rows affected, 7 warnings (0.04 sec)
  3. greatsql> start slave;
  4. Query OK, 0 rows affected, 1 warning (0.04 sec)
复制代码
7.查看主从状态
  1. greatsql> show slave status\G
  2. *************************** 1. row ***************************
  3.                Slave_IO_State: Waiting for source to send event
  4.                   Master_Host: 172.17.137.162
  5.                   Master_User: root
  6.                   Master_Port: 6001
  7.                 Connect_Retry: 60
  8.               Master_Log_File: binlog.000001
  9.           Read_Master_Log_Pos: 2959
  10.                Relay_Log_File: relaylog.000002
  11.                 Relay_Log_Pos: 395
  12.         Relay_Master_Log_File: binlog.000001
  13.              Slave_IO_Running: Yes
  14.             Slave_SQL_Running: No
  15.               Replicate_Do_DB:
  16.           Replicate_Ignore_DB:
  17.            Replicate_Do_Table:
  18.        Replicate_Ignore_Table:
  19.       Replicate_Wild_Do_Table:
  20.   Replicate_Wild_Ignore_Table:
  21.                    Last_Errno: 1062
  22.                    Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction 'e8bf88f9-2acd-11ee-a98a-00163e605c74:8' at master log binlog.000001, end_log_pos 2606. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
  23.                  Skip_Counter: 0
  24.           Exec_Master_Log_Pos: 2307
  25.               Relay_Log_Space: 1242
  26.               Until_Condition: None
  27.                Until_Log_File:
  28. greatsql> select * from performance_schema.replication_applier_status_by_worker limit 1\G
  29. *************************** 1. row ***************************
  30.                                            CHANNEL_NAME:
  31.                                               WORKER_ID: 1
  32.                                               THREAD_ID: NULL
  33.                                           SERVICE_STATE: OFF
  34.                                       LAST_ERROR_NUMBER: 1062
  35.                                      LAST_ERROR_MESSAGE: Worker 1 failed executing transaction 'e8bf88f9-2acd-11ee-a98a-00163e605c74:8' at master log binlog.000001, end_log_pos 2606; Could not execute Write_rows event on table test.test; Duplicate entry '5' for key 'test.PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 2606
  36.                                    LAST_ERROR_TIMESTAMP: 2023-08-08 16:03:39.033240
  37.                                LAST_APPLIED_TRANSACTION:
  38.      LAST_APPLIED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP: 0000-00-00 00:00:00.000000
  39.     LAST_APPLIED_TRANSACTION_IMMEDIATE_COMMIT_TIMESTAMP: 0000-00-00 00:00:00.000000
  40.          LAST_APPLIED_TRANSACTION_START_APPLY_TIMESTAMP: 0000-00-00 00:00:00.000000
  41.            LAST_APPLIED_TRANSACTION_END_APPLY_TIMESTAMP: 0000-00-00 00:00:00.000000
  42.                                    APPLYING_TRANSACTION: e8bf88f9-2acd-11ee-a98a-00163e605c74:8
  43.          APPLYING_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP: 2023-08-08 16:02:45.795753
  44.         APPLYING_TRANSACTION_IMMEDIATE_COMMIT_TIMESTAMP: 2023-08-08 16:02:45.795753
  45.              APPLYING_TRANSACTION_START_APPLY_TIMESTAMP: 2023-08-08 16:03:39.032510
  46.                  LAST_APPLIED_TRANSACTION_RETRIES_COUNT: 0
  47.    LAST_APPLIED_TRANSACTION_LAST_TRANSIENT_ERROR_NUMBER: 0
  48.   LAST_APPLIED_TRANSACTION_LAST_TRANSIENT_ERROR_MESSAGE:
  49. LAST_APPLIED_TRANSACTION_LAST_TRANSIENT_ERROR_TIMESTAMP: 0000-00-00 00:00:00.000000
  50.                      APPLYING_TRANSACTION_RETRIES_COUNT: 0
  51.        APPLYING_TRANSACTION_LAST_TRANSIENT_ERROR_NUMBER: 0
  52.       APPLYING_TRANSACTION_LAST_TRANSIENT_ERROR_MESSAGE:
  53.     APPLYING_TRANSACTION_LAST_TRANSIENT_ERROR_TIMESTAMP: 0000-00-00 00:00:00.000000
  54. 1 row in set (0.00 sec)
复制代码
可以看到从节点报错发生了主键冲突。
8.查看从节点定时任务状态

当前从节点定时任务状态为 ENABLED
  1. greatsql> show events;
  2. +------+------------+---------+-----------+-----------+------------+----------------+----------------+---------------------+------+---------+------------+----------------------+----------------------+--------------------+
  3. | Db  | Name    | Definer | Time zone | Type    | Execute at | Interval value | Interval field | Starts        | Ends | Status  | Originator | character_set_client | collation_connection | Database Collation |
  4. +------+------------+---------+-----------+-----------+------------+----------------+----------------+---------------------+------+---------+------------+----------------------+----------------------+--------------------+
  5. | test | event_test | root@%  | SYSTEM   | RECURRING | NULL    | 1        | MINUTE     | 2023-08-08 15:58:45 | NULL | ENABLED |      1 | utf8mb4        | utf8mb4_unicode_ci  | utf8mb4_unicode_ci |
  6. +------+------------+---------+-----------+-----------+------------+----------------+----------------+---------------------+------+---------+------------+---------------------+----------------------+--------------------+
  7. 1 row in set (0.00 sec)
复制代码
可以看到由于从节点的定时任务也执行了,从节点写入数据,导致主键冲突。
9.故障解决
  1. greatsql> alter event event_test DISABLE;
  2. Query OK, 0 rows affected (0.01 sec)
复制代码
关闭从节点的定时任务event,然后跳过主键冲突的报错,最后重新启动主从复制。
3.总结

1.如果主库有定时任务,通过clone的方式搭建从库,在从库恢复之后需要关闭定时任务,避免主从同时执行定时任务导致主从故障。
2.克隆时,如果捐赠节点有主从复制信息,则克隆后的接收节点也会克隆此复制信息,并在克隆完成自动重启实例后,自动启动复制;避免此问题可以在接收节点的配置文件中增加 skip-slave-start,避免节点重启后自动启动复制。

Enjoy GreatSQL
来源:https://www.cnblogs.com/greatsql/p/17786467.html
免责声明:由于采集信息均来自互联网,如果侵犯了您的权益,请联系我们【E-Mail:cb@itdo.tech】 我们会及时删除侵权内容,谢谢合作!

举报 回复 使用道具