【SQL】恶意取消订单用户

zjk 发布于 2024-03-29 106 次阅读


问题

下面是某电商网站的订单数据,包括order_id,user_id,order_status和operate_time四个字段,我们需要找出所有恶意购买的用户。恶意购买的用户定义是:同一个用户,在任意半小时内(含),取消订单次数>=2次的就被视为恶意买家。比如该样例数据中c用户就是恶意买家。

order_id   user_id   order_status    operate_time
1101         a         已支付        2023-01-01 10:00:00
1102         a         已取消        2023-01-01 10:10:00
1103         a         待支付        2023-01-01 10:20:00
1104         b         已取消        2023-01-01 10:30:00
1105         a         待确认        2023-01-01 10:50:00
1106         a         已取消        2023-01-01 11:00:00
1107         b         已取消        2023-01-01 11:40:00
1108         b         已取消        2023-01-01 11:50:00
1109         b         已支付        2023-01-01 12:00:00
1110         b         已取消        2023-01-01 12:11:00
1111         c         已取消        2023-01-01 12:20:00
1112         c         已取消        2023-01-01 12:30:00
1113         c         已取消        2023-01-01 12:55:00
1114         c         已取消        2023-01-01 13:00:00

解答

典型的滑动窗口的场景。Hive中也是有滑动窗口的功能的(按数据范围开窗,range between and )。针对这个例子,窗口大小就是半小时,然后按每条数据进行滑动,在窗口内判断该条数据对应的用户是否是恶意用户。

with temp  as (
select  1101 as order_id  ,'a'  as user_id  ,"已支付" as order_status,  "2023-01-01 10:00:00"  as operate_time
union all 
select  1102 as order_id  ,'a'  as user_id  ,"已取消" as order_status,  "2023-01-01 10:10:00"  as operate_time
union all 
select  1103 as order_id  ,'a'  as user_id  ,"待支付" as order_status,  "2023-01-01 10:20:00"  as operate_time
union all 
select  1104 as order_id  ,'b'  as user_id  ,"已取消" as order_status,  "2023-01-01 10:30:00"  as operate_time
union all 
select  1105 as order_id  ,'a'  as user_id  ,"待确认" as order_status,  "2023-01-01 10:50:00"  as operate_time
union all 
select  1106 as order_id  ,'a'  as user_id  ,"已取消" as order_status,  "2023-01-01 11:00:00"  as operate_time
union all 
select  1107 as order_id  ,'b'  as user_id  ,"已取消" as order_status,  "2023-01-01 11:40:00"  as operate_time
union all 
select  1108 as order_id  ,'b'  as user_id  ,"已取消" as order_status,  "2023-01-01 11:50:00"  as operate_time
union all 
select  1109 as order_id  ,'b'  as user_id  ,"已支付" as order_status,  "2023-01-01 12:00:00"  as operate_time
union all 
select  1110 as order_id  ,'b'  as user_id  ,"已取消" as order_status,  "2023-01-01 12:11:00"  as operate_time
union all 
select  1111 as order_id  ,'c'  as user_id  ,"已取消" as order_status,  "2023-01-01 12:20:00"  as operate_time
union all 
select  1112 as order_id  ,'c'  as user_id  ,"已取消" as order_status,  "2023-01-01 12:30:00"  as operate_time
union all 
select  1113 as order_id  ,'c'  as user_id  ,"已取消" as order_status,  "2023-01-01 12:55:00"  as operate_time
union all 
select  1114 as order_id  ,'c'  as user_id  ,"已取消" as order_status,  "2023-01-01 13:00:00"  as operate_time
)
---这里设定tmp表中放的就是上面提供的样例数据
select
distinct user_id
from
(
    select
    order_id
    ,user_id
    ,order_status
    --通过range between以当前行为锚点,圈定数据范围为operate_time为近30分钟内,然后算该范围内的取消订单数
    ,count(case when order_status='已取消' then order_id end) over(partition by user_id order by operate_time range between 1800 preceding and current row) as cancel_order_cnt
    from
    (
        select
        order_id
        ,user_id
        ,order_status
        ,unix_timestamp(operate_time) as operate_time  ---由于range...between只能整数比较,这里先转换为秒
        from temp
    )t1
)t1
where cancel_order_cnt>=2
;
----结果
user_id
b
c