翼度科技»论坛 编程开发 mysql 查看内容

GreatSQL一个关于主从复制的限制描述与规避

5

主题

5

帖子

15

积分

新手上路

Rank: 1

积分
15
一、背景

分享一个在项目运维中遇到的一个主从复制限制的一个坑,项目的架构为主集群+灾备集群,每个集群为一主两从模式。主集群到灾备集群的同步为主从复制的方式,根据业务需求灾备集群需要忽略系统库跟某些配置表,所以才会触发此限制,而这个限制如果我们之前没有遇到过,那么排查起来也是相对不易的。
二、限制描述

1、主从同步出现报错
  1. greatsql> show slave status\G
  2. *************************** 1. row ***************************
  3.                Slave_IO_State: Waiting for master to send event
  4.                   Master_Host: 192.168.xxx.xxx
  5.                   Master_User: repl
  6.                   Master_Port: 3306
  7.                 Connect_Retry: 60
  8.               Master_Log_File: greatsql-bin.000990
  9.           Read_Master_Log_Pos: 92274290
  10.                Relay_Log_File: greatsql-relay.002963     -----
  11.                 Relay_Log_Pos: 701548899
  12.         Relay_Master_Log_File: greatsql-bin.000988
  13.              Slave_IO_Running: Yes
  14.             Slave_SQL_Running: No
  15.               Replicate_Do_DB:
  16.           Replicate_Ignore_DB: mysql,dbscale,dbscale_tmp,information_schema,performance_schema,sys
  17.            Replicate_Do_Table:
  18.        Replicate_Ignore_Table:
  19.       Replicate_Wild_Do_Table:
  20.   Replicate_Wild_Ignore_Table: A.ab,B.bc
  21.                    Last_Errno: 1146
  22.                    Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '9e668a93-2618-11ee-93ee-bc16954181bb:47508257' at master log greatdb-bin.000988, end_log_pos 701570116. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
  23.                  Skip_Counter: 0
  24.           Exec_Master_Log_Pos: 701548690
  25.               Relay_Log_Space: 2246320360
  26.               Until_Condition: None
  27.                Until_Log_File:
  28.                 Until_Log_Pos: 0
  29.            Master_SSL_Allowed: No
  30.            Master_SSL_CA_File:
  31.            Master_SSL_CA_Path:
  32.               Master_SSL_Cert:
  33.             Master_SSL_Cipher:
  34.                Master_SSL_Key:
  35.         Seconds_Behind_Master: NULL
  36. Master_SSL_Verify_Server_Cert: No
  37.                 Last_IO_Errno: 0
  38.                 Last_IO_Error:
  39.                Last_SQL_Errno: 1146
  40.                Last_SQL_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '9e668a93-2618-11ee-93ee-bc16954181bb:47508257' at master log greatdb-bin.000988, end_log_pos 701570116. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
  41.   Replicate_Ignore_Server_Ids:
  42.              Master_Server_Id: 1943306
  43.                   Master_UUID: 9e668a93-2618-11ee-93ee-bc16954181bb
  44.              Master_Info_File: mysql.slave_master_info
  45.                     SQL_Delay: 0
  46.           SQL_Remaining_Delay: NULL
  47.       Slave_SQL_Running_State:
  48.            Master_Retry_Count: 86400
  49.                   Master_Bind:
  50.       Last_IO_Error_Timestamp:
  51.      Last_SQL_Error_Timestamp: 230822 14:14:18
  52.                Master_SSL_Crl:
  53.            Master_SSL_Crlpath:
  54.            Retrieved_Gtid_Set: 9e668a93-2618-11ee-93ee-bc16954181bb:2-47565802
  55.             Executed_Gtid_Set: 30873cfe-8750-11ed-b56f-744aa4073024:1-270,
  56. 9e668a93-2618-11ee-93ee-bc16954181bb:1-47508256
  57.                 Auto_Position: 1
  58.          Replicate_Rewrite_DB:  
  59.                  Channel_Name:
  60.            Master_TLS_Version:
  61. 1 row in set (0.00 sec)
复制代码
根据slave status状态信息可以看出

  • 报错的GTID为:'9e668a93-2618-11ee-93ee-bc16954181bb:47508257'
  • 应用的主集群的binlog为:greatsql-bin.000988
  • 灾备集群的relay log为:greatsql-relay.002963
详细信息查看performance_schema.replication_applier_status_by_worker表
2、查看错误的详细信息
  1. greatsql> select * from performance_schema.replication_applier_status_by_worker\G
  2. *************************** 1. row ***************************
  3.          CHANNEL_NAME:
  4.             WORKER_ID: 1
  5.             THREAD_ID: NULL
  6.         SERVICE_STATE: OFF
  7. LAST_SEEN_TRANSACTION: 9e668a93-2618-11ee-93ee-bc16954181bb:47508257
  8.     LAST_ERROR_NUMBER: 1146
  9.    LAST_ERROR_MESSAGE: Worker 1 failed executing transaction
  10. '9e668a93-2618-11ee-93ee-bc16954181bb:47508257' at master log greatdb-bin.000988,
  11. end_log_pos 701570116; Error executing row event: 'Table 'abs_xxx.tmp_xxx_info' doesn't exist'
  12. LAST_ERROR_TIMESTAMP: 2023-08-22 14:14:18
复制代码
上述信息说明根据performance_schema.replication_applier_status_by_worker表中的详细错误信息可以发现为灾备集群abs_xxx.tmp_xxx_info表不存在,导致同步报错
3、问题分析

3.1、确认灾备集群中目标表是否存在
  1. greatsql> show create table abs_xxx.tmp_xxx_info;
  2. ERROR 1146 (42S02): Table 'abs_xxx.tmp_xxx_info' doesn't exist
  3. greatsql> desc abs_xxx.tmp_xxx_info;
  4. ERROR 1146 (42S02): Table 'abs_xxx.tmp_xxx_info' doesn't exist
复制代码
结论:灾备集群中目标表的确不存在
3.2、根据主从报错信息解析主集群binlog,报错的SQL
解析主集群binlog
  1. SET @@SESSION.GTID_NEXT= '9e668a93-2618-11ee-93ee-bc16954181bb:47508257'/*!*/;
  2. ……
  3. #230822 14:14:18 server id 1943306  end_log_pos 701570000         Table_map: `abs_xxx`.`tmp_xxx_info` mapped to number 1595
  4. # at 701570000
  5. #230822 14:14:18 server id 1943306  end_log_pos 701570116         Write_rows: table id 1595 flags: STMT_END_F
  6. ### INSERT INTO `abs_xxx`.`tmp_xxx_info`
  7. ### SET
  8. ###   @1=2
  9. ###   @2='自动化'
  10. ###   @3='2300121212120000'
  11. ###   @4='90000000'
  12. ###   @5='1'
  13. ###   @6='202001290231001'
  14. ###   @7='2021-01-31 00:00:00'
  15. # at 701570116
  16. #230822 14:14:18 server id 1943306  end_log_pos 701570143         Xid = 800998400
  17. COMMIT/*!*/;
  18. # at 701570143
  19. #230822 14:14:18 server id 1943306  end_log_pos 701570204         GTID        last_committed=26491        sequence_number=26521        rbr_only=yes
  20. /*!50718 SET TRANSACTION ISOLATION LEVEL READ COMMITTED*//*!*/;
复制代码
结论:根据复制的报错信息得知具体的GTID号以及主集群的binlog文件,解析binlog得知此事务为一条INSERT语句,语句中的目标表与performance_schema.replication_applier_status_by_worker表中信息一致
3.3、寻找主集群目标表binlog中是否有建表语句
在同一binlog日志中寻找建表语句
  1. SET TIMESTAMP=1692684495/*!*/;
  2. CREATE DATABASE IF NOT EXISTS `abs_xxx` /*!40100 DEFAULT CHARACTER SET utf8 COLLATE utf8_bin */
  3. /*!*/;
  4. ……
  5. use `information_schema`/*!*/;
  6. SET TIMESTAMP=1692684495/*!*/;
  7. CREATE TABLE `abs_xxx`.`tmp_xxx_info` (
  8.   `ID` int(64) NOT NULL AUTO_INCREMENT,
  9.   `CUSTOMER_NAME` varchar(256) COLLATE utf8mb4_bin DEFAULT NULL,
  10.   `CUSTOMER_NUM` varchar(256) COLLATE utf8mb4_bin DEFAULT NULL,
  11.   `PRODIST_SKU_NUM` varchar(64) COLLATE utf8mb4_bin DEFAULT NULL,
  12.   `STATUS` varchar(64) COLLATE utf8mb4_bin DEFAULT NULL,
  13.   `AGREEMENT_NUM` varchar(256) COLLATE utf8mb4_bin DEFAULT NULL,
  14.   `END_DATE` datetime DEFAULT NULL,
  15.   PRIMARY KEY (`ID`),
  16.   KEY `PK_PRODIST_SKU_NUM_AGREEMENT` (`PRODIST_SKU_NUM`) USING BTREE,
  17.   KEY `IDX_AGREEMENT_NUM` (`AGREEMENT_NUM`) USING BTREE
  18. ) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin ROW_FORMAT=DYNAMIC
  19. /*!*/;
  20. # at 475864451
复制代码
结论:在主集群的binlog日志中找到了目标表的建表语句,说明主集群执行DDL时并没有关闭binlog日志,那么继续查看在灾备集群的中继日志中是否存在DDL语句
3.4、解析灾备集群的中继日志,确认是否拉取到灾备集群
  1. #230822 14:08:15 server id 1943306  end_log_pos 475863662         GTID        last_committed=16341        sequence_number=16342        rbr_only=no
  2. SET @@SESSION.GTID_NEXT= '9e668a93-2618-11ee-93ee-bc16954181bb:47498079'/*!*/;
  3. ……
  4. use `information_schema`/*!*/;
  5. SET TIMESTAMP=1692684495/*!*/;
  6. CREATE TABLE `abs_xxx`.`tmp_xxx_info` (
  7.   `ID` int(64) NOT NULL AUTO_INCREMENT,
  8.   `CUSTOMER_NAME` varchar(256) COLLATE utf8mb4_bin DEFAULT NULL,
  9.   `CUSTOMER_NUM` varchar(256) COLLATE utf8mb4_bin DEFAULT NULL,
  10.   `PRODIST_SKU_NUM` varchar(64) COLLATE utf8mb4_bin DEFAULT NULL,
  11.   `STATUS` varchar(64) COLLATE utf8mb4_bin DEFAULT NULL,
  12.   `AGREEMENT_NUM` varchar(256) COLLATE utf8mb4_bin DEFAULT NULL,
  13.   `END_DATE` datetime DEFAULT NULL,
  14.   PRIMARY KEY (`ID`),
  15.   KEY `PK_PRODIST_SKU_NUM_AGREEMENT` (`PRODIST_SKU_NUM`) USING BTREE,
  16.   KEY `IDX_AGREEMENT_NUM` (`AGREEMENT_NUM`) USING BTREE
  17. ) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin ROW_FORMAT=DYNAMIC
  18. /*!*/;
  19. # at 475864660
  20. #230822 14:08:15 server id 1943306  end_log_pos 475864512         GTID        last_committed=16342        sequence_number=16343        rbr_only=yes
  21. /*!50718 SET TRANSACTION ISOLATION LEVEL READ COMMITTED*//*!*/;
复制代码
结论:灾备集群的中继日志中存在DDL建表语句,说明并不是IO线程出了问题
3.5、排查复制配置的忽略库表
  1. Replicate_Ignore_DB: mysql,dbscale,dbscale_tmp,information_schema,performance_schema,sys
  2. Replicate_Wild_Ignore_Table: A.ab,B.bc
复制代码
结论:忽略库表中并不包含目标表,但是根据以上解析日志发现,在主集群binlog日志中建表语句之前有个use information_schema/!/; 的语句,此库为同步忽略的系统库,因此触发了GreatSQL的规范限制,在忽略库下对未忽略进行操作Statement模式下记录语句默认不起作用 (详情:https://dev.mysql.com/doc/refman/5.7/en/replication-options-replica.html#option_mysqld_replicate-do-db)
4、解决同步报错

在灾备集群创建目标表
  1. greatsql> CREATE TABLE `abs_xxx`.`tmp_xxx_info` (
  2.   `ID` int(64) NOT NULL AUTO_INCREMENT,
  3.   `CUSTOMER_NAME` varchar(256) COLLATE utf8mb4_bin DEFAULT NULL,
  4.   `CUSTOMER_NUM` varchar(256) COLLATE utf8mb4_bin DEFAULT NULL,
  5.   `PRODIST_SKU_NUM` varchar(64) COLLATE utf8mb4_bin DEFAULT NULL,
  6.   `STATUS` varchar(64) COLLATE utf8mb4_bin DEFAULT NULL,
  7.   `AGREEMENT_NUM` varchar(256) COLLATE utf8mb4_bin DEFAULT NULL,
  8.   `END_DATE` datetime DEFAULT NULL,
  9.   PRIMARY KEY (`ID`),
  10.   KEY `PK_PRODIST_SKU_NUM_AGREEMENT` (`PRODIST_SKU_NUM`) USING BTREE,
  11.   KEY `IDX_AGREEMENT_NUM` (`AGREEMENT_NUM`) USING BTREE
  12. ) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin ROW_FORMAT=DYNAMIC;
  13. greatsql> stop slave;
  14. greatsql> start slave;
复制代码
结论:在灾备集群创建目标表后重启复制恢复成功
三、限制规避

1、第一种规避方式

执行DDL时进入目标库
  1. greatsql> use abs_cust
  2. greatsql> DDL 语句(CREATE\DROP\ALTER)
复制代码
说明:在应用连接数据库时有可能默认就是information_schema库,而此环境将系统库全部忽略,所以为了规避类似的问题,请在执行SQL语句时请先use到目标表的目标库。
2、第二种规避方式

修改主从复制配置,以下步骤为测试环境
关闭灾备集群在复制同步
  1. greatsql> stop slave;
  2. Query OK, 0 rows affected, 1 warning (0.03 sec)
复制代码
修改忽略库
  1. greatsql> change replication filter Replicate_Ignore_DB=();
复制代码
修改忽略表
  1. greatsql> change replication filter replicate_wild_ignore_table =('mysql.%','information_schema.%','sys.%','performance_schema.%');
复制代码
启动同步
  1. greatsql> start slave;
  2. Query OK, 0 rows affected, 1 warning (0.37 sec)
复制代码
测试验证
主集群:
  1. greatsql> use mysql
  2. Reading table information for completion of table and column names
  3. You can turn off this feature to get a quicker startup with -A
  4. Database changed
  5. greatsql> create table test111.test111(id int primary key);
  6. Query OK, 0 rows affected (0.06 sec)
  7. greatsql> show tables;
  8. +-------------------+
  9. | Tables_in_test111 |
  10. +-------------------+
  11. | test111           |
  12. +-------------------+
  13. 1 row in set (0.00 sec)
复制代码
灾备集群:
  1. greatsql> use test111
  2. Reading table information for completion of table and column names
  3. You can turn off this feature to get a quicker startup with -A
  4. Database changed
  5. greatsql> show tables;
  6. +-------------------+
  7. | Tables_in_test111 |
  8. +-------------------+
  9. | test111           |
  10. +-------------------+
  11. 1 row in set (0.00 sec)
复制代码
说明:复制配置中参数Replicate_Ignore_DB设置为空,将replicate_wild_ignore_table参数设置为shema_name.%的方式也可以规避类似的问题
四、特别说明


  • 在MySQL 5.7跟8.0版本也存在此限制

Enjoy GreatSQL
来源:https://www.cnblogs.com/greatsql/p/17731957.html
免责声明:由于采集信息均来自互联网,如果侵犯了您的权益,请联系我们【E-Mail:cb@itdo.tech】 我们会及时删除侵权内容,谢谢合作!

举报 回复 使用道具