常用窗口函数
1.前言又到了年底,最近在做年报数据拉取的需求,在这期间有一些数据的计算需要用到窗口函数,就去了解了一下常用的窗口函数的用法。
1.1.什么是窗口函数
窗口函数是 SQL 中的一种特殊函数,它允许你在查询结果中的某个窗口或窗口范围上执行计算,而不是单独针对整个结果集进行计算。
窗口函数通常与 OVER() 子句结合使用,用于定义窗口的范围。这个范围可以是整个结果集、分组的行、特定的物理行数范围,或者是基于值范围的窗口。
窗口函数通常在查询的最后阶段执行,即在结果集形成之后对结果集中的行应用窗口函数。窗口函数计算的结果是基于最终结果集的行,而不影响原始数据的分组,也就是说窗口函数的结果会添加到原查询结果的每一行上,这是个人觉得这是OVEE(PARTITION BY COLUMN_NAME ORDER BY COLUMN_NAME) 与 传统的 GROUP BY COLUMN ORDER BY COLUMN 差别最大的地方。
2.常用窗口函数
可以使用以下sql进行初始化
CREATE TABLE `video_play` (
`id` int NOT NULL AUTO_INCREMENT,
`user_name` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL COMMENT '用户名',
`video_type` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL COMMENT '视频类型',
`duration` int DEFAULT NULL COMMENT '播放时间',
`date` datetime DEFAULT NULL COMMENT '开始播放时间',
PRIMARY KEY (`id`)
);
INSERT INTO `video_play` (`id`, `user_name`, `video_type`, `duration`, `date`) VALUES (1, 'zhangsan', 'cartoon', 126, '2023-12-20 10:23:23');
INSERT INTO `video_play` (`id`, `user_name`, `video_type`, `duration`, `date`) VALUES (2, 'zhangsan', 'movie', 38, '2023-12-20 12:40:32');
INSERT INTO `video_play` (`id`, `user_name`, `video_type`, `duration`, `date`) VALUES (3, 'zhangsan', 'teleplay', 98, '2023-12-20 13:25:53');
INSERT INTO `video_play` (`id`, `user_name`, `video_type`, `duration`, `date`) VALUES (4, 'zhangsan', 'news', 45, '2023-12-20 15:24:43');
INSERT INTO `video_play` (`id`, `user_name`, `video_type`, `duration`, `date`) VALUES (5, 'lisi', 'cartoon', 83, '2023-12-20 09:28:17');
INSERT INTO `video_play` (`id`, `user_name`, `video_type`, `duration`, `date`) VALUES (6, 'lisi', 'movie', 56, '2023-12-20 12:11:11');
INSERT INTO `video_play` (`id`, `user_name`, `video_type`, `duration`, `date`) VALUES (7, 'lisi', 'teleplay', 79, '2023-12-20 13:56:23');
INSERT INTO `video_play` (`id`, `user_name`, `video_type`, `duration`, `date`) VALUES (8, 'wangwu', 'cartoon', 156, '2023-12-20 08:17:14');
INSERT INTO `video_play` (`id`, `user_name`, `video_type`, `duration`, `date`) VALUES (9, 'wangwu', 'movie', 89, '2023-12-20 09:25:35');
INSERT INTO `video_play` (`id`, `user_name`, `video_type`, `duration`, `date`) VALUES (10, 'wangwu', 'teleplay', 986, '2023-12-20 11:45:11');
INSERT INTO `video_play` (`id`, `user_name`, `video_type`, `duration`, `date`) VALUES (11, 'wangwu', 'news', 124, '2023-12-20 16:56:01');
INSERT INTO `video_play` (`id`, `user_name`, `video_type`, `duration`, `date`) VALUES (12, 'xiaoming', 'cartoon', 123, '2023-12-20 07:08:34');
INSERT INTO `video_play` (`id`, `user_name`, `video_type`, `duration`, `date`) VALUES (13, 'xiaoming', 'movie', 963, '2023-12-20 09:13:23');
2.1.SUM()
用法:SUM(COLUMN_NAME) OVER( );
如果存在PARTITION BY COLUMN_NAME, 则代表查询结果根据COLUMN_NAME进行分组求和。
如果存在ORDER BY COLUMN_NAME, 则代表查询结果则根据COLUMN_NAME排序后, 按照排序的结果进行累积求和。这里的累积求和是指在当前窗口内,当前行的值加上前行的值得到的结果。如果不加ORDER BY同一窗口内每一行的值都相同,为当前窗口内所有值的和。
sql举例:
SELECT
*,
SUM(duration) OVER() AS exp1,
SUM(duration) OVER(PARTITION BY user_name) AS exp2,
SUM(duration) OVER(ORDER BY `date` ASC) AS exp3,
SUM(duration) OVER(PARTITION BY user_name ORDER BY `date` ASC) AS exp4
FROM
video_play;
2.2.COUNT()
用法:COUNT( COLUMN_NAME) OVER( )
整体用法跟SUM()函数类似,只是由求和变成了计数,并且支持DISTINCT去重,目前已知的是Mysql不支持。
sql举例:
SELECT
*,
COUNT(duration) OVER() AS exp1,
COUNT(duration) OVER(PARTITION BY video_type) AS exp2,
COUNT(duration) OVER(ORDER BY duration) AS exp3,
COUNT(duration) OVER(PARTITION BY video_type ORDER BY duration ASC) AS exp4
FROM
video_play;
2.3.AVG()
用法:AVG(COLUMN_NAME) OVER( );
整体用法跟SUM()函数类似,只是由求和变成了求平均值。
sql举例:
SELECT
*,
AVG(duration) OVER() AS exp1,
AVG(duration) OVER(PARTITION BY user_name) AS exp2,
AVG(duration) OVER(ORDER BY `date` ASC) AS exp3,
AVG(duration) OVER(PARTITION BY user_name ORDER BY `date` ASC) AS exp4
FROM
video_play;
2.4.ROW_NUMBER()
用法:ROW_NUMBER() OVER( )
这个函数的功能是求结果的行号, 可以使用ORDER BY 进行内部排序之后再去求行号, 行号是连续的不会出现空缺
sql举例:
SELECT
*,
ROW_NUMBER() OVER() AS exp1,
ROW_NUMBER() OVER(PARTITION BY user_name) AS exp2,
ROW_NUMBER() OVER(ORDER BY `date` ASC) AS exp3,
ROW_NUMBER() OVER(PARTITION BY user_name ORDER BY `date` ASC) AS exp4
FROM
video_play;
这个函数是这次年报中用的比较多的函数,简化版的需求就是求每个用户播放时长TOPN的类型及播放时间。实现思路是现根据用户分组,再使用该函数对时长进行排序并生成行号,最后取每个用户的前N行数据,下面是实现的sql及结果
WITH user_video_play_index AS ( SELECT user_name, video_type, duration, ROW_NUMBER() OVER(PARTITION BY user_name ORDER BY duration DESC, date ASC) AS row_num FROM video_play) SELECT user_name, video_type, duration, row_numFROM user_video_play_indexWHERE row_num
页:
[1]