Semi-Join Subquery优化策略

刘思雨 · 发表于 2023-5-6 19:30:08

Semi-Join Subquery优化策略

Semi-Join Subquery(半连接子查询)：对应IN或EXISTS子查询，仅需要检查"外表记录"在"子查询结果集"中是否存在匹配记录，不需要计算"子查询结果集"中记录匹配次数，也不需要返回"子查询结果集"中匹配记录内容
在MariaDB(MySQL)中，常用优化Semi-Join(半连接)的策略有：

First Match
Table Pullout
Semi-join Materialization
Loose Scan
Duplicate Weedout

First Match策略

当循环"外部查询结果集"的每条记录去"子查询中"确认"是否匹配"时，只需要找到第一条匹配记录(First Match)既可跳出子查询。
如下面查询：

SELECT * FROM Country
WHERE Country.code IN (
SELECT City.Country
FROM City
WHERE City.Population > 1*1000*1000
)
AND Country.continent='Europe'

复制代码

如果不使用First Match策略，当处理到Country表上满足" Country.continent='Europe' "条件的德国(Deu)记录时，会扫描City表上满足" City.Population > 110001000 AND City.Country='DEU' "的所有记录，再根据匹配记录总数返回"是否匹配"结果：

如果使用First Match策略，当处理到Country表上满足" Country.continent='Europe' "条件的德国(Deu)记录时，会扫描City表上满足" City.Population > 110001000 AND City.Country='DEU' "的第一条记录"Berlin"后，立即返回"是否匹配"结果：

在MariaDB上使用First Match策略的查询的执行计划为：

MariaDB [world]> explain select * from Country where Country.code IN (select City.Country from City where City.Population > 1*1000*1000) and Country.continent='Europe';
+----+-------------+---------+------+--------------------+-----------+---------+--------------------+------+----------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------+------+--------------------+-----------+---------+--------------------+------+----------------------------------+
| 1 | PRIMARY | Country | ref | PRIMARY,continent | continent | 17 | const | 60 | Using index condition |
| 1 | PRIMARY | City | ref | Population,Country | Country | 3 | world.Country.Code | 18 | Using where; FirstMatch(Country) |
+----+-------------+---------+------+--------------------+-----------+---------+--------------------+------+----------------------------------+
2 rows in set (0.00 sec)

复制代码

MariaDB的执行计划中会有明显的FirstMatch标识。
在MySQL上使用First Match策略的查询的执行计划为：

MySQL [world]> explain select * from Country where Country.code IN (select City.Country from City where City.Population > 1*1000*1000) and Country.continent='Europe';
+----+--------------------+---------+----------------+--------------------+-----------+---------+-------+------+------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+---------+----------------+--------------------+-----------+---------+-------+------+------------------------------------+
| 1 | PRIMARY | Country | ref | continent | continent | 17 | const | 60 | Using index condition; Using where |
| 2 | DEPENDENT SUBQUERY | City | index_subquery | Population,Country | Country | 3 | func | 18 | Using where |
+----+--------------------+---------+----------------+--------------------+-----------+---------+-------+------+------------------------------------+
2 rows in set (0.01 sec)

复制代码

MariaDB的执行计划中仅显示为依赖子查询(DEPENDENT SUBQUERY)
First Match策略和将IN子查询转换为EXISTS依赖子查询很相似，但两者还是存在明显差异，并非所有EXISTS操作都能使用First Match策略，如子查询中使用GROUP BY相关的聚合函数时，需要先完成GROUP BY操作才能确认"是否匹配"。
Table Pullout策略

当子查询的查询列表项只有主键或唯一索引键时，能推算出"子查询结果集"不存在重复记录，因此可以将子查询改为关联查询，即将子查询中的表上提到关联查询。
对于查询：

SELECT *
FROM City
WHERE City.Country IN (
SELECT Country.Code
FROM Country
WHERE Country.Population < 100*1000
);

复制代码

在MariaDB 5.2 和MySQL 5.6版本及之前版本上，执行计划为：

MySQL [world]> explain select * from City where City.Country in (select Country.Code from Country where Country.Population < 100*1000);
+----+--------------------+---------+-----------------+--------------------+---------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+---------+-----------------+--------------------+---------+---------+------+------+-------------+
| 1 | PRIMARY | City | ALL | NULL | NULL | NULL | NULL | 4079 | Using where |
| 2 | DEPENDENT SUBQUERY | Country | unique_subquery | PRIMARY,Population | PRIMARY | 3 | func | 1 | Using where |
+----+--------------------+---------+-----------------+--------------------+---------+---------+------+------+-------------+
2 rows in set (0.00 sec)

复制代码

如果Country.Code是主键或唯一索引，则可以将SQL改写为：

SELECT City.*
FROM City
INNER JOIN Country
ON City.Country=Country.Code
WHERE Country.Population < 100*1000;

复制代码

改为关联查询后，可以根据两张关联表的统计数据来选择驱动表和被驱动表，因此在MariaDB 5.3或MySQL 5.7版本，执行计划为：

MariaDB [world]> explain select * from City where City.Country in (select Country.Code from Country where Country.Population < 100*1000);
+----+-------------+---------+-------+--------------------+------------+---------+--------------------+------+-----------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------+-------+--------------------+------------+---------+--------------------+------+-----------------------+
| 1 | PRIMARY | Country | range | PRIMARY,Population | Population | 4 | NULL | 37 | Using index condition |
| 1 | PRIMARY | City | ref | Country | Country | 3 | world.Country.Code | 18 | |
+----+-------------+---------+-------+--------------------+------------+---------+--------------------+------+-----------------------+
2 rows in set (0.00 sec)

复制代码

Materialization策略

在使用Table Pullout策略时，需要能明确推算出"子查询结果集"不存在重复记录时才能将"子查询"改为"关联查询"，如果将"子查询结果集"通过临时表去重固化后消除重复记录，则可以将子查询转换为"关联查询"，即Materialization策略。
如对于查询：

SELECT * FROM Country
WHERE Country.code IN (
SELECT City.Country
FROM City
WHERE City.Population > 1*1000*1000
)
AND Country.continent='Europe'

复制代码

在转换为"关联查询"后，按照"关联查询"中临时表是否为"驱动表"可以将Semi-join Materialization策略细分为：

Materialization/scan 策略，将临时表作为"驱动表"，遍历临时表中每条记录去另外关联表中查找匹配记录。
Materialization/lookup 策略，将临时表作为"被驱动表"，遍历另外的关联表在临时表中查询匹配记录。

使用Materialization/scan 策略时，MariaDB 查询计划为：

MariaDB [world]> explain select * from Country where Country.code IN (select City.Country from City where City.Population > 7*1000*1000);
+----+--------------+-------------+--------+--------------------+------------+---------+--------------------+------+-----------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------+-------------+--------+--------------------+------------+---------+--------------------+------+-----------------------+
| 1 | PRIMARY | <subquery2> | ALL | distinct_key | NULL | NULL | NULL | 15 | |
| 1 | PRIMARY | Country | eq_ref | PRIMARY | PRIMARY | 3 | world.City.Country | 1 | |
| 2 | MATERIALIZED | City | range | Population,Country | Population | 4 | NULL | 15 | Using index condition |
+----+--------------+-------------+--------+--------------------+------------+---------+--------------------+------+-----------------------+
3 rows in set (0.01 sec)

复制代码

使用Materialization/lookup 策略时，MariaDB 查询计划为：

MariaDB [world]> explain select * from Country where Country.code IN (select City.Country from City where City.Population > 1*1000*1000) ;
+----+--------------+-------------+--------+--------------------+--------------+---------+------+------+-----------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------+-------------+--------+--------------------+--------------+---------+------+------+-----------------------+
| 1 | PRIMARY | Country | ALL | PRIMARY | NULL | NULL | NULL | 239 | |
| 1 | PRIMARY | <subquery2> | eq_ref | distinct_key | distinct_key | 3 | func | 1 | |
| 2 | MATERIALIZED | City | range | Population,Country | Population | 4 | NULL | 238 | Using index condition |
+----+--------------+-------------+--------+--------------------+--------------+---------+------+------+-----------------------+
3 rows in set (0.00 sec)

复制代码

Loose Scan策略

在Materialization/scan 策略时，需要先将"子查询结果集"移除重复记录并固化到临时表，再作为驱动表进行关联查询。MySQL特性Index Loose Scan能在一次扫描中得跳过重复索引键得到"没有重复记录的临时结果集"，Loose Scan策略基于Index Loose Scan特性保证关联查询不会出现"重复关联问题"。
如对于查询：

SELECT * FROM Country
WHERE Country.code IN (
SELECT country_code FROM Satellite
)

复制代码

如果Satellite.country_code 存在索引，基于Index Loose Scan特性则能快速获得"SELECT DISTINCT country_code FROM Satellite"的效果，如图所示：

使用Loose Scan 策略时，MariaDB 查询计划为：

MariaDB [world]> explain select * from Country where Country.code in (select country_code from Satellite);
+----+-------------+-----------+--------+---------------+--------------+---------+------------------------------+------+-------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+--------+---------------+--------------+---------+------------------------------+------+-------------------------------------+
| 1 | PRIMARY | Satellite | index | country_code | country_code | 9 | NULL | 932 | Using where; Using index; LooseScan |
| 1 | PRIMARY | Country | eq_ref | PRIMARY | PRIMARY | 3 | world.Satellite.country_code | 1 | Using index condition |
+----+-------------+-----------+--------+---------------+--------------+---------+------------------------------+------+-------------------------------------+

复制代码

Loose Scan 策略和Materialization/scan 策略区别：

Materialization/scan 策略：先将子查询的查询结果固化去重后，再作为驱动表与外部表进行关联查询，查询使用到临时表。
Loose Scan 策略：在对子查询的表进行Index Loose Scan操作过程中，直接将遍历到的记录与与外部表进行关联查询，查询未使用到临时表。

Duplicate Weedout策略

当无法根据表结构信息推算出"子查询结果集"不存在重复记录时，如果将子查询改写为关联查询，则会导致"外表记录"被关联匹配多次而产生重复记录，可以通过将关联结果集插入到"带有唯一索引的临时表"的方式来移除重复记录，保证最终查询结果的准确性。
对于查询：

SELECT *
FROM Country
WHERE Country.code IN (
SELECT City.Country
FROM City
WHERE City.Population > 0.33 * Country.Population
AND City.Population > 1*1000*1000
);

复制代码

可以改写为：

CREATE tmp_Country LIKE Country;
INSERT IGNORE INTO tmp_Country
SELECT Country.*
FROM Country
INNER JOIN City
ON Country.code = City.Country
WHERE City.Population > 0.33 * Country.Population
AND City.Population > 1*1000*1000
SELECT * FROM tmp_Country;

复制代码

如图所示：

使用Duplicate Weedout 策略时，MariaDB 查询计划为：

explain select * from Country where Country.code IN (select City.Country from City where City.Population > 0.33 * Country.Population and City.Population > 1*1000*1000)\G
*************************** 1. row ***************************
id: 1
select_type: PRIMARY
table: City
type: range
possible_keys: Population,Country
key: Population
key_len: 4
ref: NULL
rows: 238
Extra: Using index condition; Start temporary
*************************** 2. row ***************************
id: 1
select_type: PRIMARY
table: Country
type: eq_ref
possible_keys: PRIMARY
key: PRIMARY
key_len: 3
ref: world.City.Country
rows: 1
Extra: Using where; End temporary
2 rows in set (0.00 sec)

复制代码

学习总结

First Match策略通过"找到第一条匹配记录即返回"的方式来跳过无效子查询扫描。
除First Match策略外都是子查询转换为关联查询来优化提升查询效率，按照不同查询场景采用不同策略来"避免重复记录"：

Table Pullout策略，通过唯一索引和主键索引逻辑来确认"子查询结果集"中重复记录。
Materialization策略，通过临时表来移除"子查询结果集"中重复记录。
Loose Scan策略，通过Index Loose Scan特性来跳过"子查询结果集"中重复记录。
Duplicate Weedout策略，通过临时表来将移除"关联查询结果集"中重复记录。

参考资料

Semi-join Subquery Optimizations

来源:https://www.cnblogs.com/gaogao67/p/17378214.html
免责声明：由于采集信息均来自互联网，如果侵犯了您的权益，请联系我们【E-Mail:cb@itdo.tech】我们会及时删除侵权内容，谢谢合作！

Semi-Join Subquery优化策略

本帖子中包含更多资源