翼度科技»论坛 编程开发 mysql 查看内容

GreatSQL删除分区慢的跟踪

7

主题

7

帖子

21

积分

新手上路

Rank: 1

积分
21
GreatSQL删除分区慢的跟踪

背景

某业务系统,每天凌晨会删除分区表的一个分区(按天分区),耗时较久,从最开始的30秒,慢慢变为1分钟+,影响到交易业务的正常进行。 在测试环境进行了模拟,复现了删除分区慢的情况,本次基于GreatSQL8.0.25-17进行测试,官方mysql版本也存在相同问题。
测试环境
  1. $ mysql -h127.0.0.1 -P8025 -uroot -p
  2. mysql: [Warning] Using a password on the command line interface can be insecure.
  3. Welcome to the MySQL monitor.  Commands end with ; or \g.
  4. Your MySQL connection id is 10
  5. Server version: 8.0.25-17 GreatSQL, Release 17, Revision 4733775f703
  6. Copyright (c) 2000, 2021, Oracle and/or its affiliates.
  7. Oracle is a registered trademark of Oracle Corporation and/or its
  8. affiliates. Other names may be trademarks of their respective
  9. owners.
  10. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
  11. greatsql> select version();
  12. +-----------+
  13. | version() |
  14. +-----------+
  15. | 8.0.25-17 |
  16. +-----------+
  17. 1 row in set (0.00 sec)
  18. greatsql> show variables like 'autocommit' ;
  19. +---------------+-------+
  20. | Variable_name | Value |
  21. +---------------+-------+
  22. | autocommit    | ON    |
  23. +---------------+-------+
  24. 1 row in set (0.01 sec)
  25. greatsql> show variables like 'innodb_buffer_pool_size';
  26. +-------------------------+------------+
  27. | Variable_name           | Value      |
  28. +-------------------------+------------+
  29. | innodb_buffer_pool_size | 4294967296 |
  30. +-------------------------+------------+
  31. 1 row in set (0.01 sec)
  32. greatsql> show variables like 'innodb_flush_log_at_trx_commit';
  33. +--------------------------------+-------+
  34. | Variable_name                  | Value |
  35. +--------------------------------+-------+
  36. | innodb_flush_log_at_trx_commit | 1     |
  37. +--------------------------------+-------+
  38. 1 row in set (0.00 sec)
  39. greatsql> show variables like 'sync_binlog';
  40. +---------------+-------+
  41. | Variable_name | Value |
  42. +---------------+-------+
  43. | sync_binlog   | 1     |
  44. +---------------+-------+
  45. 1 row in set (0.00 sec)
复制代码
建表
  1. CREATE TABLE `t_partition` (
  2.   `id` bigint NOT NULL AUTO_INCREMENT,
  3.   `ua` varchar(100) DEFAULT NULL,
  4.   `start_time` datetime NOT NULL,
  5.   PRIMARY KEY (`id`,`start_time`)
  6. )  PARTITION BY RANGE (to_days(`start_time`))
  7. (PARTITION P20230129 VALUES LESS THAN (738915) ENGINE = InnoDB,
  8. PARTITION P20230130 VALUES LESS THAN (738916) ENGINE = InnoDB,
  9. PARTITION P20230131 VALUES LESS THAN (738917) ENGINE = InnoDB,
  10. PARTITION P20230201 VALUES LESS THAN (738918) ENGINE = InnoDB,
  11. PARTITION P20230202 VALUES LESS THAN (738919) ENGINE = InnoDB,
  12. PARTITION P20230203 VALUES LESS THAN (738920) ENGINE = InnoDB,
  13. PARTITION P20230204 VALUES LESS THAN (738921) ENGINE = InnoDB,
  14. PARTITION P20230302 VALUES LESS THAN (738947) ENGINE = InnoDB,
  15. PARTITION P20230303 VALUES LESS THAN (738948) ENGINE = InnoDB,
  16. PARTITION P20230304 VALUES LESS THAN (738949) ENGINE = InnoDB,
  17. PARTITION P20230305 VALUES LESS THAN (738950) ENGINE = InnoDB,
  18. PARTITION P20230306 VALUES LESS THAN (738951) ENGINE = InnoDB,
  19. PARTITION P20230307 VALUES LESS THAN (738952) ENGINE = InnoDB,
  20. PARTITION P20230308 VALUES LESS THAN (738953) ENGINE = InnoDB,
  21. PARTITION P20230309 VALUES LESS THAN (738954) ENGINE = InnoDB,
  22. PARTITION P20230310 VALUES LESS THAN (738955) ENGINE = InnoDB,
  23. PARTITION P20230311 VALUES LESS THAN (738956) ENGINE = InnoDB,
  24. PARTITION P20230312 VALUES LESS THAN (738957) ENGINE = InnoDB,
  25. PARTITION P20230313 VALUES LESS THAN (738958) ENGINE = InnoDB,
  26. PARTITION P20230314 VALUES LESS THAN (738959) ENGINE = InnoDB,
  27. PARTITION P20230315 VALUES LESS THAN (738960) ENGINE = InnoDB,
  28. PARTITION P20230316 VALUES LESS THAN (738961) ENGINE = InnoDB,
  29. PARTITION P20230317 VALUES LESS THAN (738962) ENGINE = InnoDB,
  30. PARTITION P20230318 VALUES LESS THAN (738963) ENGINE = InnoDB,
  31. PARTITION P20230319 VALUES LESS THAN (738964) ENGINE = InnoDB,
  32. PARTITION P20230320 VALUES LESS THAN (738965) ENGINE = InnoDB,
  33. PARTITION P20230321 VALUES LESS THAN (738966) ENGINE = InnoDB,
  34. PARTITION P20230322 VALUES LESS THAN (738967) ENGINE = InnoDB,
  35. PARTITION P20230323 VALUES LESS THAN (738968) ENGINE = InnoDB,
  36. PARTITION P20230324 VALUES LESS THAN (738969) ENGINE = InnoDB,
  37. PARTITION P20230325 VALUES LESS THAN (738970) ENGINE = InnoDB,
  38. PARTITION P20230326 VALUES LESS THAN (738971) ENGINE = InnoDB,
  39. PARTITION P20230327 VALUES LESS THAN (738972) ENGINE = InnoDB,
  40. PARTITION P20230328 VALUES LESS THAN (738973) ENGINE = InnoDB,
  41. PARTITION p20230329 VALUES LESS THAN (738974) ENGINE = InnoDB,
  42. PARTITION p20230330 VALUES LESS THAN (738975) ENGINE = InnoDB,
  43. PARTITION p20230331 VALUES LESS THAN (738976) ENGINE = InnoDB,
  44. PARTITION p20230401 VALUES LESS THAN (738977) ENGINE = InnoDB,
  45. PARTITION p20230402 VALUES LESS THAN (738978) ENGINE = InnoDB,
  46. PARTITION p20230403 VALUES LESS THAN (738979) ENGINE = InnoDB);
复制代码
插入数据
  1. greatsql> insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),'2023-03-10' ;
  2. Query OK, 1 row affected (0.01 sec)
  3. Records: 1  Duplicates: 0  Warnings: 0
  4. greatsql> insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),'2023-03-10' ;
  5. Query OK, 1 row affected (0.00 sec)
  6. Records: 1  Duplicates: 0  Warnings: 0
  7. greatsql> insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),'2023-03-10' ;
  8. Query OK, 1 row affected (0.01 sec)
  9. Records: 1  Duplicates: 0  Warnings: 0
  10. greatsql> insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),start_time from t_partition;
  11. Query OK, 3 rows affected (0.01 sec)
  12. Records: 3  Duplicates: 0  Warnings: 0
  13. greatsql> insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),start_time from t_partition;
  14. Query OK, 6 rows affected (0.00 sec)
  15. Records: 6  Duplicates: 0  Warnings: 0
  16. ...............
  17. greatsql> insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),start_time from t_partition;
  18. Query OK, 3145728 rows affected (35.68 sec)
  19. Records: 3145728  Duplicates: 0  Warnings: 0
  20. greatsql> insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),start_time from t_partition;
  21. Query OK, 6291456 rows affected (1 min 11.51 sec)
  22. Records: 6291456  Duplicates: 0  Warnings: 0
  23. greatsql> insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),start_time from t_partition;
  24. Query OK, 12582912 rows affected (2 min 26.74 sec)
  25. Records: 12582912  Duplicates: 0  Warnings: 0
  26. greatsql> select count(*) from t_partition;
  27. +----------+
  28. | count(*) |
  29. +----------+
  30. | 25165824 |
  31. +----------+
  32. 1 row in set (0.50 sec)
  33. greatsql> select count(*) from t_partition partition(P20230310);
  34. +----------+
  35. | count(*) |
  36. +----------+
  37. | 25165824 |
  38. +----------+
复制代码
向分区插入数据
  1. insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),'2023-03-11' from t_partition partition(P20230310);
  2. insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),'2023-03-12' from t_partition partition(P20230310);
  3. insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),'2023-03-13' from t_partition partition(P20230310);
  4. insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),'2023-03-14' from t_partition partition(P20230310);
  5. insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),'2023-03-15' from t_partition partition(P20230310);
  6. insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),'2023-03-16' from t_partition partition(P20230310);
  7. insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),'2023-03-17' from t_partition partition(P20230310);
  8. insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),'2023-03-18' from t_partition partition(P20230310);
  9. insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),'2023-03-19' from t_partition partition(P20230310);
  10. insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),'2023-03-20' from t_partition partition(P20230310);
  11. insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),'2023-03-21' from t_partition partition(P20230310);
  12. 。。。。。。。。。。。
  13. greatsql>  insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),'2023-03-19' from t_partition partition(P20230310);
  14. Query OK, 25165824 rows affected (5 min 17.92 sec)
  15. Records: 25165824  Duplicates: 0  Warnings: 0
  16. greatsql>  insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),'2023-03-20' from t_partition partition(P20230310);
  17. Query OK, 25165824 rows affected (5 min 19.56 sec)
  18. Records: 25165824  Duplicates: 0  Warnings: 0
  19. greatsql>  insert into t_partition(ua,start_time) select substring(md5(rand()),1,20),'2023-03-21' from t_partition partition(P20230310);
  20. Query OK, 25165824 rows affected (5 min 27.27 sec)
  21. Records: 25165824  Duplicates: 0  Warnings: 0
复制代码
更新数据
  1. greatsql> Update t_partition set ua=concat(ua,'abc') where start_time='2023-03-19';
  2. Query OK, 25165824 rows affected (12 min 55.53 sec)
  3. Rows matched: 25165824  Changed: 25165824  Warnings: 0
复制代码
删除分区s
  1. greatsql> alter table t_partition drop partition P20230311;
  2. Query OK, 0 rows affected (13.68 sec)
  3. Records: 0 Duplicates: 0 Warnings: 0
  4. greatsql> alter table t_partition drop partition P20230312;
  5. Query OK, 0 rows affected (0.07 sec)
  6. Records: 0 Duplicates: 0 Warnings: 0
复制代码
两个分区数据量是一样,但删除第一个分区耗时较长。
  1. $ perf record -ag -p  11222  -o /mysqldb/perf_drop_part_mysql2.data
  2. Warning:
  3. PID/TID switch overriding SYSTEM
  4. ^C[ perf record: Woken up 41 times to write data ]
  5. [ perf record: Captured and wrote 10.610 MB /mysqldb/perf_drop_part_mysql2.data (54351 samples) ]
  6.   Children      Self  Command  Shared Object        Symbol                                                   
  7. +   99.17%     0.00%  mysqld   libpthread-2.17.so   [.] start_thread                                         
  8. +   99.17%     0.00%  mysqld   mysqld               [.] pfs_spawn_thread                                    
  9. +   99.17%     0.00%  mysqld   mysqld               [.] handle_connection                                    
  10. +   99.17%     0.00%  mysqld   mysqld               [.] do_command                                          
  11. +   99.17%     0.00%  mysqld   mysqld               [.] dispatch_command                                    
  12. +   99.17%     0.00%  mysqld   mysqld               [.] dispatch_sql_command                                 
  13. +   99.16%     0.00%  mysqld   mysqld               [.] mysql_execute_command                                
  14. +   99.16%     0.00%  mysqld   mysqld               [.] Sql_cmd_alter_table::execute                        
  15. +   99.16%     0.00%  mysqld   mysqld               [.] mysql_alter_table                                    
  16. +   99.09%     0.00%  mysqld   mysqld               [.] mysql_inplace_alter_table                           
  17. +   98.56%     0.00%  mysqld   mysqld               [.] ha_innopart::commit_inplace_alter_partition         
  18. +   98.54%     0.00%  mysqld   mysqld               [.] alter_parts::prepare_or_commit_for_new               
  19. +   98.54%     0.00%  mysqld   mysqld               [.] alter_part_normal::try_commit                        
  20. +   98.53%     0.00%  mysqld   mysqld               [.] btr_drop_ahi_for_table                              
  21. +   98.52%     1.30%  mysqld   mysqld               [.] btr_drop_next_batch                                 
  22. +   97.20%     0.03%  mysqld   mysqld               [.] btr_search_drop_page_hash_when_freed                 
  23. +   96.34%     2.03%  mysqld   mysqld               [.] btr_search_drop_page_hash_index                     
  24. +   86.52%    52.25%  mysqld   mysqld               [.] ha_remove_all_nodes_to_page                          
  25. +   34.21%    34.15%  mysqld   mysqld               [.] ha_delete_hash_node                                 
  26. +    4.27%     2.68%  mysqld   mysqld               [.] rec_get_offsets_func                                 
  27. +    2.11%     2.10%  mysqld   mysqld               [.] ut_fold_binary                                       
  28. +    1.58%     1.58%  mysqld   mysqld               [.] rec_init_offsets                                    
  29. +    1.30%     1.30%  mysqld   mysqld               [.] rec_fold                                             
  30. +    0.57%     0.03%  mysqld   mysqld               [.] buf_page_get_gen                                    
  31. +    0.57%     0.00%  mysqld   mysqld               [.] execute_native_thread_routine                        
  32. +    0.56%     0.01%  mysqld   [kernel.kallsyms]    [k] system_call_fastpath
复制代码
从系统调用上看有大量的自适应hash相关的调用
重启关闭自适应hash
  1. greatsql> show variables like '%hash%';
  2. +----------------------------------+-------+
  3. | Variable_name                    | Value |
  4. +----------------------------------+-------+
  5. | innodb_adaptive_hash_index       | OFF   |
  6. | innodb_adaptive_hash_index_parts | 8     |
  7. +----------------------------------+-------+
复制代码
修改配置文件,关闭自适应hash,按照上面的流程从新执行
  1. greatsql> alter table t_partition drop partition P20230311;
  2. Query OK, 0 rows affected (0.08 sec)
  3. Records: 0 Duplicates: 0 Warnings: 0
  4. greatsql> alter table t_partition drop partition P20230312;
  5. Query OK, 0 rows affected (0.08 sec)
  6. Records: 0 Duplicates: 0 Warnings: 0
复制代码
关闭自适应hash后,相同的操作过程,删除第一个分区的时间明显变短,删除每个分区的时间基本上一致。
备注:innodb_adaptive_hash_index是全局变量,可以动态修改,不重启数据库。
测试结果汇总

自适应hash对比第一次删分区第二次删分区innodb_buffer_pool_instances=8&  innodb_adaptive_hash_index=on13.680.07innodb_buffer_pool_instances=8&  innodb_adaptive_hash_index=off0.080.08源码分析
  1. // btr_drop_ahi_for_table
  2. void btr_drop_ahi_for_table(dict_table_t *table) {
  3.   const ulint len = UT_LIST_GET_LEN(table->indexes);
  4.   if (len == 0) {
  5.     return;
  6.   }
  7.   const dict_index_t *indexes[MAX_INDEXES];
  8.   const page_size_t page_size(dict_table_page_size(table));
  9.   for (;;) {
  10.     ulint ref_count = 0;
  11.     const dict_index_t **end = indexes;
  12.     for (dict_index_t *index = table->first_index(); index != nullptr;
  13.          index = index->next()) {
  14.       if (ulint n_refs = index->search_info->ref_count) {
  15.         ut_ad(!index->disable_ahi);
  16.         ut_ad(index->is_committed());
  17.         ref_count += n_refs;
  18.         ut_ad(indexes + len > end);
  19.         *end++ = index;
  20.       }
  21.     }
  22.     ut_ad((indexes == end) == (ref_count == 0));
  23.     if (ref_count == 0) {
  24.       return;
  25.     }
  26.     btr_drop_next_batch(page_size, indexes, end);  // breakpoint  
  27.     std::this_thread::yield();
  28.   }
  29. }
  30. // btr_drop_next_batch
  31. static void btr_drop_next_batch(const page_size_t &page_size,
  32.                                 const dict_index_t **first,
  33.                                 const dict_index_t **last) {
  34.   static constexpr unsigned batch_size = 1024;
  35.   std::vector<page_id_t> to_drop;
  36.   to_drop.reserve(batch_size);
  37.   for (ulint i = 0; i < srv_buf_pool_instances; ++i) {
  38.     to_drop.clear();
  39.     buf_pool_t *buf_pool = buf_pool_from_array(i);
  40.     mutex_enter(&buf_pool->LRU_list_mutex);
  41.     const buf_page_t *prev;
  42.     for (const buf_page_t *bpage = UT_LIST_GET_LAST(buf_pool->LRU);
  43.          bpage != nullptr; bpage = prev) {
  44.       prev = UT_LIST_GET_PREV(LRU, bpage);
  45.       ut_a(buf_page_in_file(bpage));
  46.       if (buf_page_get_state(bpage) != BUF_BLOCK_FILE_PAGE ||
  47.           bpage->buf_fix_count > 0) {
  48.         continue;
  49.       }
  50.       const dict_index_t *block_index =
  51.           reinterpret_cast<const buf_block_t *>(bpage)->ahi.index;
  52.       /* index == nullptr means the page is no longer in AHI, so no need to
  53.       attempt freeing it */
  54.       if (block_index == nullptr) {
  55.         continue;
  56.       }
  57.       /* pages IO fixed for read have index == nullptr */
  58.       ut_ad(!bpage->was_io_fix_read());
  59.       if (std::find(first, last, block_index) != last) {
  60.         to_drop.emplace_back(bpage->id);
  61.         if (to_drop.size() == batch_size) {   // batch_size = 1024
  62.           break;
  63.         }
  64.       }
  65.     }
  66.     mutex_exit(&buf_pool->LRU_list_mutex);
  67.     for (const page_id_t &page_id : to_drop) {
  68.       btr_search_drop_page_hash_when_freed(page_id, page_size); // breakpoint
  69.     }
  70.   }
  71. }
  72. // btr_search_drop_page_hash_when_freed
  73. void btr_search_drop_page_hash_when_freed(const page_id_t &page_id,
  74.                                           const page_size_t &page_size) {
  75.   buf_block_t *block;
  76.   mtr_t mtr;
  77.   ut_d(export_vars.innodb_ahi_drop_lookups++);
  78.   mtr_start(&mtr);
  79.   /* If the caller has a latch on the page, then the caller must
  80.   have a x-latch on the page and it must have already dropped
  81.   the hash index for the page. Because of the x-latch that we
  82.   are possibly holding, we cannot s-latch the page, but must
  83.   (recursively) x-latch it, even though we are only reading. */
  84.   block = buf_page_get_gen(page_id, page_size, RW_X_LATCH, nullptr,
  85.                            Page_fetch::PEEK_IF_IN_POOL, UT_LOCATION_HERE, &mtr);
  86.   if (block) {
  87.     /* If AHI is still valid, page can't be in free state.
  88.     AHI is dropped when page is freed. */
  89.     ut_ad(!block->page.file_page_was_freed);
  90.     buf_block_dbg_add_level(block, SYNC_TREE_NODE_FROM_HASH);
  91.     dict_index_t *index = block->ahi.index;
  92.     if (index != nullptr) {
  93.       /* In all our callers, the table handle should
  94.       be open, or we should be in the process of
  95.       dropping the table (preventing eviction). */
  96.       ut_ad(index->table->n_ref_count > 0 || dict_sys_mutex_own());
  97.       btr_search_drop_page_hash_index(block);
  98.     }
  99.   }
  100.   mtr_commit(&mtr);
  101. }
复制代码
函数逻辑说明:
  1. drop 分区和add分区都会清空所有分区的AHI信息,最耗时的如下
  2. 循环每个分区调用函数
  3. ->btr_drop_ahi_for_table
  4.   循环表(或者分区)中的每个索引,如果索引都没有用到AHI,
  5. 则退出
  6.   循环innodb buffer中的每个实例,根据LRU链表循环每个page
  7. 如果page建立了AHI信息,且是要删除表(或者分区)的相关索引
  8.   则放入drop vector容器中
  9. 如果page没有建立AHI信息
  10. 则跳过
  11. 如果drop verctor容器中填满1024个page
  12. 则清理一次,循环每个page,调用函数
  13. ->btr_search_drop_page_hash_index
  14.   计算page所在AHI结构的slot信息,以便找到对应的hash_table_t结构
  15.   循环page中所有的行
  16.     循环行中访问到的索引字段(访问模式),计算出fold信息填入到fold[]数组中
  17.     本循环中会通过函数rec_get_offsets进行字段偏移量的获取,为耗用CPU的函数
  18.       循环fold[]数组,一个fold代表一行数据,调用函数
  19.        ->ha_remove_all_nodes_to_page,为耗用CPU的函数
  20.          ->ha_chain_get_fist
  21.            根据fold信息找到hash结构的cell
  22.          循环本cell中的链表信息
  23.            如果行的地址在本要删除的page上,调用函数
  24.            ->ha_delete_hash_node,为消耗CPU的函数
  25.              进行链表和hash结构的维护
  26. 每次处理完1024个page后,yeild线程主动放弃CPU,避免长期占用CPU,醒来后继续处理
复制代码
drop 分区和add分区都会循环每个分区调用函数btr_drop_ahi_for_table 、btr_search_drop_page_hash_index清空所有分区及索引的的AHI信息, 删除第1个分区的时ahi信息被清空, 删第2个分区的时候buffer中已经没有ahi信息了,所有删除第2个分区就很快了。
避免方式

针对以上原因,线上执行 drop必须遵守以下原则:
1、关闭AHI功能,不使用AHI带来的查询加速功能,需要先在测试环境进行业务测试,确保业务性能不受影响。
2、删除表的第一个分区时,内部会清理该表在每个buffer pool实例中对应的数据块页面,耗时较久,接着删其他分区耗时很小,建议将每天一次的删除分区的操作改为每周或者每月批量执行删除分区的操作,并且需要在业务低峰期操作。

Enjoy GreatSQL
来源:https://www.cnblogs.com/greatsql/p/17462356.html
免责声明:由于采集信息均来自互联网,如果侵犯了您的权益,请联系我们【E-Mail:cb@itdo.tech】 我们会及时删除侵权内容,谢谢合作!

举报 回复 使用道具