2022-01-24 | 标签 Hadoop Hive

UDF：一进一出 //普通函数 UDAF：多进一出 //聚合函数 UDTF：一进多出 //炸裂函数一，多指的是输入数据的行数

1. 系统内置函数

1）查看系统自带的函数

hive (hive)> show functions;
Time taken: 0.23 seconds, Fetched: 289 row(s)

2）显示自带的函数的用法

hive (hive)> desc function upper;
OK
tab_name
upper(str) - Returns str with all characters changed to uppercase

3）详细显示自带的函数的用法

hive (hive)> desc function extended  upper;
OK
tab_name
upper(str) - Returns str with all characters changed to uppercase
Synonyms: ucase
Example:
  > SELECT upper('Facebook') FROM src LIMIT 1;
  'FACEBOOK'
Function class:org.apache.hadoop.hive.ql.udf.generic.GenericUDFUpper
Function type:BUILTIN

2. 常用内置函数

2.1 空字段赋值

1）函数说明

NVL：给值为 NULL 的数据赋值，它的格式是 NVL( value，default_value)。它的功能是如果 value 为 NULL，则 NVL 函数返回 default_value 的值，否则返回 value 的值，如果两个参数都为 NULL ，则返回 NULL。

2）数据准备：采用员工表

3）查询：如果员工的 comm 为 NULL，则用-1 代替

hive (hive)> select comm,nvl(comm, -1) from emp;
OK
comm    _c1
NULL    -1.0
300.0   300.0
500.0   500.0
NULL    -1.0
1400.0  1400.0
NULL    -1.0
NULL    -1.0
NULL    -1.0
NULL    -1.0
0.0     0.0
NULL    -1.0
NULL    -1.0
NULL    -1.0
NULL    -1.0
NULL    -1.0

4）查询：如果员工的 comm 为 NULL，则用领导 id 代替

hive (hive)> select comm, nvl(comm,mgr) from emp;
OK
comm    _c1
NULL    7902.0
300.0   300.0
500.0   500.0
NULL    7839.0
1400.0  1400.0
NULL    7839.0
NULL    7839.0
NULL    7566.0
NULL    NULL
0.0     0.0
NULL    7788.0
NULL    7698.0
NULL    7566.0
NULL    7782.0
NULL    7782.0

2.2 CASE WHEN THEN ELSE END

1）数据准备

| name | dept_id | sex | | — | — | — | | 悟空 | A | 男 | | 大海 | A | 男 | | 宋宋 | B | 男 | | 凤姐 | A | 女 | | 婷姐 | B | 女 | | 婷婷 | B | 女 |

2）需求：

求出不同部门男女各多少人。结果如下： dept_Id 男女 A 2 1 B 1 2

3）创建本地 emp_sex.txt，导入数据

悟空	A	男
大海 	A	男
宋宋 	B	男
凤姐 	A	女
婷姐	B	女
婷婷	B	女

4）创建 hive 表并导入数据

create table emp_sex(
name string, 
dept_id string, 
sex string) 
row format delimited fields terminated by "\t";

load data local inpath '/opt/module/datas/emp_sex.txt' into table emp_sex;

5）按需求查询数据

select 
    dept_id,
    sum(case sex when '男' then 1 else 0 end) maleCount,
    sum(case sex when '女' then 1 else 0 end) femaleCount
from emp_sex
group by dept_id;

dept_id malecount       femalecount
A       2       1
B       1       2

if法

IF(expr1,expr2,expr3) - If expr1 is TRUE (expr1 <> 0 and expr1 <> NULL) 
then IF() returns expr2; otherwise it returns expr3. 
IF() returns a numeric or string value, depending on the context in which it is used.

select 
    dept_id,
    sum(if (sex ='男',1,0)) maleCount,
    sum(if (sex ='女',1,0)) femaleCount
from emp_sex
group by dept_id;

dept_id malecount       femalecount
A       2       1
B       1       2

2.3 行转列

1）相关函数说明

CONCAT(string A/col, string B/col…)：返回输入字符串连接后的结果，支持任意个输入字符串; CONCAT_WS(separator, str1, str2,…)：它是一个特殊形式的CONCAT()。第一个参数剩余参数间的分隔符。分隔符可以是与剩余参数一样的字符串。如果分隔符是 NULL，返回值也将为NULL。这个函数会跳过分隔符参数后的任何 NULL 和空字符串。分隔符将被加到被连接的字符串之间; 注意: CONCAT_WS must be “string or array COLLECT_SET(col)：函数只接受基本数据类型，它的主要作用是将某字段的值进行去重汇总，产生 Array 类型字段。

2）示例

hive (hive)> select concat(deptno,'-',dname,'-',loc) from dept;
OK
_c0
10-ACCOUNTING-1700
20-RESEARCH-1800
30-SALES-1900
40-OPERATIONS-1700

hive (hive)> select concat_ws('-','mhk','like','jooye');
OK
_c0
mhk-like-jooye

hive (default)> select * from test1;
OK
test1.name      test1.friends   test1.children  test1.address
songsong        ["bingbing","lili"]     {"xiao song":18,"xiaoxiao song":19}     {"street":"hui long guan","city":"beijing"}
yangyang        ["caicai","susu"]       {"xiao yang":18,"xiaoxiao yang":19}     {"street":"chao yang","city":"beijing"}
Time taken: 0.295 seconds, Fetched: 2 row(s)
hive (default)> select concat_ws('-',friends) from test1;
OK
_c0
bingbing-lili
caicai-susu

hive (default)> select concat_ws('-',children) from test1;
FAILED: SemanticException [Error 10016]: 
Line 1:21 Argument type mismatch 'children': 
Argument 2 of function CONCAT_WS must be "string or array<string>", 
but "map<string,int>" was found.

3）数据准备

孙悟空	白羊座	A
大海	射手座	A
宋宋	白羊座	B
猪八戒	白羊座	A
凤姐	射手座	A
煤球王	白羊座	B

4）建立hive表并导入数据

create table person_info(
name string, 
constellation string, 
blood_type string) 
row format delimited fields terminated by "\t";

load data local inpath "/opt/module/datas/person_info.txt" into table person_info;

5）需求

把星座和血型一样的人归类到一起。结果如下：

射手座,A 	大海|凤姐
白羊座,A 	孙悟空|猪八戒
白羊座,B 	宋宋|煤球王

1、将星座和血型拼接在一起

select 
    concat(constellation,',',blood_type) con_blood,
    name
from person_info;t1 

2.聚合相同星座血型的人的姓名

select
    con_blood,
    concat_ws('|',collect_set(name))
from
(select 
    concat(constellation,',',blood_type) con_blood,
    name
from person_info)t1
group by con_blood;

结果：

con_blood       _c1
射手座,A        大海|凤姐
白羊座,A        孙悟空|猪八戒
白羊座,B        宋宋|煤球王

2.4 列转行

1）函数说明

EXPLODE(col)：将 hive 一列中复杂的 Array 或者 Map 结构拆分成多行。 LATERAL VIEW 用法：LATERAL VIEW udtf(expression) tableAlias AS columnAlias 解释：用于和 split, explode 等 UDTF 一起使用，它能够将一列数据拆成多行数据，在此基础上可以对拆分后的数据进行聚合。

2）数据准备

3）需求

将电影分类中的数组数据展开。结果如下：

《疑犯追踪》	悬疑
《疑犯追踪》	动作
《疑犯追踪》	科幻
《疑犯追踪》	剧情
《Lie to me》	悬疑
《Lie to me》	警匪
《Lie to me》	动作
《Lie to me》	心理
《Lie to me》	剧情
《战狼 2》	战争
《战狼 2》	动作
《战狼 2》	灾难

4）创建本地 movie.txt，导入数据

《疑犯追踪》		悬疑,动作,科幻,剧情
《Lie to me》	悬疑,警匪,动作,心理,剧情
《战狼 2》		战争,动作,灾难

5）创建 hive 表并导入数据

create table movie_info(
movie string,
category string)
row format delimited fields terminated by "\t";

load data local inpath "/opt/module/datas/movie.txt" into table movie_info;

6）按需求查询数据

hive (hive)> select split(category,',') from movie_info;
OK
_c0
["悬疑","动作","科幻","剧情"]
["悬疑","警匪","动作","心理","剧情"]
["战争","动作","灾难"]
Time taken: 0.349 seconds, Fetched: 3 row(s)

hive (hive)> desc function explode;
OK
tab_name
explode(a) - separates the elements of array a into multiple rows, 
or the elements of a map into multiple rows and columns 

hive (hive)> select explode(split(category,',')) from movie_info;--把string类型的category转成数组--
OK
col
悬疑
动作
科幻
剧情
悬疑
警匪
动作
心理
剧情
战争
动作
灾难

//侧写表
select
    movie,
    category_info
from movie_info
lateral VIEW 
    explode(split(category,',')) movie_info_tmp as category_info;

movie   category_info
《疑犯追踪》    悬疑
《疑犯追踪》    动作
《疑犯追踪》    科幻
《疑犯追踪》    剧情
《Lie to me》   悬疑
《Lie to me》   警匪
《Lie to me》   动作
《Lie to me》   心理
《Lie to me》   剧情
《战狼 2》      战争
《战狼 2》      动作
《战狼 2》      灾难

2.5 窗口函数（开窗函数）

1）相关函数说明

OVER()：指定分析函数工作的数据窗口大小，这个数据窗口大小可能会随着行的变而变化。 CURRENT ROW：当前行 n PRECEDING：往前 n 行数据 n FOLLOWING：往后 n 行数据 UNBOUNDED：起点， UNBOUNDED PRECEDING 表示从前面的起点， UNBOUNDED FOLLOWING 表示到后面的终点 LAG(col,n,default_val)：往前第 n 行数据 LEAD(col,n, default_val)：往后第 n 行数据 NTILE(n)：把有序窗口的行分发到指定数据的组中，各个组有编号，编号从 1 开始，对于每一行，NTILE 返回此行所属的组的编号。注意：n 必须为 int 类型。

2）数据准备：name，orderdate，cost

jack,2017-01-01,10
tony,2017-01-02,15
jack,2017-02-03,23
tony,2017-01-04,29
jack,2017-01-05,46
jack,2017-04-06,42
tony,2017-01-07,50
jack,2017-01-08,55
mart,2017-04-08,62
mart,2017-04-09,68
neil,2017-05-10,12
mart,2017-04-11,75
neil,2017-06-12,80
mart,2017-04-13,94

3）需求

（1）查询在 2017 年 4 月份购买过的顾客及总人数（2）查询顾客的购买明细及月购买总额（3）上述的场景, 将每个顾客的 cost 按照日期进行累加（4）查询每个顾客上次的购买时间（5）查询前 20%时间的订单信息

4）创建本地 business.txt，导入数据

5）创建 hive 表并导入数据

create table business(
name string,
orderdate string,
cost int
) row format delimited fields terminated by ',';

load data local inpath "/opt/module/datas/business.txt" into table business;

6）按需求查询数据

（1）查询在 2017 年 4 月份购买过的顾客及总人数

select * from business where substring(orderdate,0,7)='2017-04';
OK
business.name   business.orderdate      business.cost
jack    2017-04-06      42
mart    2017-04-08      62
mart    2017-04-09      68
mart    2017-04-11      75
mart    2017-04-13      94

select distinct(name) from business where substring(orderdate,1,7)='2017-04';--名字去重--
jack
mart
select name from business where substring(orderdate,0,7)='2017-04' group by name;
jack
mart

select count(distinct(name)) from business where substring(orderdate,1,7)='2017-04';
2

--这是统计了两个用户在4月购买的次数--
select 
    name,
    count(*)
from business
where substring(orderdate,0,7)='2017-04'
group by name;
jack	1
mart	4

count(*)，对我定义的这个窗口里面的东西求和，可我并没有在窗口里写东西，那就对所有行求和，所有行就两行，
select 
    name,
    count(*) over()
from business
where substring(orderdate,0,7)='2017-04'
group by name;

jack	2
mart	2

over函数的作用

hive (hive)> select name,count(*) from business;
行数不一致，不让走
FAILED: SemanticException [Error 10025]: Line 1:7 Expression not in GROUP BY key 'name'

hive (hive)> select name,count(*) from business group by name;
name    _c1
jack    5
mart    4
neil    2
tony    3

hive (hive)> select name,count(*) over() from business;
name    count_window_0
mart    14
neil    14
mart    14
neil    14
mart    14
mart    14
jack    14
tony    14
jack    14
jack    14
tony    14
jack    14
tony    14
jack    14

over对每一行开窗计算，这和group by相反，group by相同的数据只有一个组 over一行数据对应一个组，只不过这个组的数据是相同的，而group by是多行数据对应一个组所以over相当于group by 是在支持分组聚合后，冗余保存了更多的信息

（2）查询顾客的购买明细及月购买总额

select name,orderdate,cost,sum(cost) over(partition by name) from business;
name    orderdate       cost    sum_window_0
jack    2017-01-05      46      176
jack    2017-01-08      55      176
jack    2017-01-01      10      176
jack    2017-04-06      42      176
jack    2017-02-03      23      176
mart    2017-04-13      94      299
mart    2017-04-11      75      299
mart    2017-04-09      68      299
mart    2017-04-08      62      299
neil    2017-05-10      12      92
neil    2017-06-12      80      92
tony    2017-01-04      29      94
tony    2017-01-02      15      94
tony    2017-01-07      50      94

select name,orderdate,cost,sum(cost) over(partition by name,month(orderdate)) from business;
name    orderdate       cost    sum_window_0
jack    2017-01-05      46      111
jack    2017-01-08      55      111
jack    2017-01-01      10      111
jack    2017-02-03      23      23
jack    2017-04-06      42      42
mart    2017-04-13      94      299
mart    2017-04-11      75      299
mart    2017-04-09      68      299
mart    2017-04-08      62      299
neil    2017-05-10      12      12
neil    2017-06-12      80      80
tony    2017-01-04      29      94
tony    2017-01-02      15      94
tony    2017-01-07      50      94

查询顾客的购买明细及所有顾客月购买总额
select name,orderdate,cost,sum(cost) over(partition by month(orderdate)) from business;
name    orderdate       cost    sum_window_0
jack    2017-01-01      10      205
jack    2017-01-08      55      205
tony    2017-01-07      50      205
jack    2017-01-05      46      205
tony    2017-01-04      29      205
tony    2017-01-02      15      205
jack    2017-02-03      23      23
mart    2017-04-13      94      341
jack    2017-04-06      42      341
mart    2017-04-11      75      341
mart    2017-04-09      68      341
mart    2017-04-08      62      341
neil    2017-05-10      12      12
neil    2017-06-12      80      80

（3）将每个顾客的 cost 按照日期进行累加

select name,orderdate,cost,sum(cost) over(partition by name order by orderdate)
from business;
name    orderdate       cost    sum_window_0
jack    2017-01-01      10      10
jack    2017-01-05      46      56
jack    2017-01-08      55      111
jack    2017-02-03      23      134
jack    2017-04-06      42      176
mart    2017-04-08      62      62
mart    2017-04-09      68      130
mart    2017-04-11      75      205
mart    2017-04-13      94      299
neil    2017-05-10      12      12
neil    2017-06-12      80      92
tony    2017-01-02      15      15
tony    2017-01-04      29      44
tony    2017-01-07      50      94
//也可以
select name,orderdate,cost,sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and current row)
from business;
name    orderdate       cost    sum_window_0
jack    2017-01-01      10      10
jack    2017-01-05      46      56
jack    2017-01-08      55      111
jack    2017-02-03      23      134
jack    2017-04-06      42      176
mart    2017-04-08      62      62
mart    2017-04-09      68      130
mart    2017-04-11      75      205
mart    2017-04-13      94      299
neil    2017-05-10      12      12
neil    2017-06-12      80      92
tony    2017-01-02      15      15
tony    2017-01-04      29      44
tony    2017-01-07      50      94

官方文档：
When ORDER BY is specified with missing WINDOW clause, 
the WINDOW specification defaults to RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.


将每个顾客的 cost 按照日期,累加前一天今天和后一天
select name,orderdate,cost,sum(cost) over(partition by name order by orderdate rows between 1 preceding and 1 following)
from business; 
name    orderdate       cost    sum_window_0
jack    2017-01-01      10      56
jack    2017-01-05      46      111
jack    2017-01-08      55      124
jack    2017-02-03      23      120
jack    2017-04-06      42      65
mart    2017-04-08      62      130
mart    2017-04-09      68      205
mart    2017-04-11      75      237
mart    2017-04-13      94      169
neil    2017-05-10      12      92
neil    2017-06-12      80      92
tony    2017-01-02      15      44
tony    2017-01-04      29      94
tony    2017-01-07      50      79

将cost按照日期进行累计
select name,orderdate,cost,sum(cost) over(order by orderdate)
from business;
name    orderdate       cost    sum_window_0
jack    2017-01-01      10      10
tony    2017-01-02      15      25
tony    2017-01-04      29      54
jack    2017-01-05      46      100
tony    2017-01-07      50      150
jack    2017-01-08      55      205
jack    2017-02-03      23      228
jack    2017-04-06      42      270
mart    2017-04-08      62      332
mart    2017-04-09      68      400
mart    2017-04-11      75      475
mart    2017-04-13      94      569
neil    2017-05-10      12      581
neil    2017-06-12      80      661

累加相同字段情况说明

数据：num表
num.id
1
2
3
3
4
5

select id sum(id) over(order by id) from num;
1		1
2		3
3		9
3		9
4		13
5		18

对于1，开窗是1，累加一条数据，对于2，开窗1到2，累加两条数据，对于3来说，由于有两个三，开窗累加就是4条数据

（4）查询每位顾客上次购买的时间（把时间下移一位）

select
    name,orderdate,lag(orderdate,1) over(partition by name order by orderdate)
from business;
name    orderdate       lag_window_0
jack    2017-01-01      NULL
jack    2017-01-05      2017-01-01
jack    2017-01-08      2017-01-05
jack    2017-02-03      2017-01-08
jack    2017-04-06      2017-02-03
mart    2017-04-08      NULL
mart    2017-04-09      2017-04-08
mart    2017-04-11      2017-04-09
mart    2017-04-13      2017-04-11
neil    2017-05-10      NULL
neil    2017-06-12      2017-05-10
tony    2017-01-02      NULL
tony    2017-01-04      2017-01-02
tony    2017-01-07      2017-01-04

//带参数
select
    name,orderdate,lag(orderdate,1,'1970-01-01') over(partition by name order by orderdate)
from business;
name    orderdate       lag_window_0
jack    2017-01-01      1970-01-01
jack    2017-01-05      2017-01-01
jack    2017-01-08      2017-01-05
jack    2017-02-03      2017-01-08
jack    2017-04-06      2017-02-03
mart    2017-04-08      1970-01-01
mart    2017-04-09      2017-04-08
mart    2017-04-11      2017-04-09
mart    2017-04-13      2017-04-11
neil    2017-05-10      1970-01-01
neil    2017-06-12      2017-05-10
tony    2017-01-02      1970-01-01
tony    2017-01-04      2017-01-02
tony    2017-01-07      2017-01-04


//第一行没数据，给自己
select
    name,orderdate,lag(orderdate,1,orderdate) over(partition by name order by orderdate)
from business;
name    orderdate       lag_window_0
jack    2017-01-01      2017-01-01
jack    2017-01-05      2017-01-01
jack    2017-01-08      2017-01-05
jack    2017-02-03      2017-01-08
jack    2017-04-06      2017-02-03
mart    2017-04-08      2017-04-08
mart    2017-04-09      2017-04-08
mart    2017-04-11      2017-04-09
mart    2017-04-13      2017-04-11
neil    2017-05-10      2017-05-10
neil    2017-06-12      2017-05-10
tony    2017-01-02      2017-01-02
tony    2017-01-04      2017-01-02
tony    2017-01-07      2017-01-04


//后一行，拿上去，lead
select
    name,orderdate,lead(orderdate,1,orderdate) over(partition by name order by orderdate)
from business;
name    orderdate       lead_window_0
jack    2017-01-01      2017-01-05
jack    2017-01-05      2017-01-08
jack    2017-01-08      2017-02-03
jack    2017-02-03      2017-04-06
jack    2017-04-06      2017-04-06
mart    2017-04-08      2017-04-09
mart    2017-04-09      2017-04-11
mart    2017-04-11      2017-04-13
mart    2017-04-13      2017-04-13
neil    2017-05-10      2017-06-12
neil    2017-06-12      2017-06-12
tony    2017-01-02      2017-01-04
tony    2017-01-04      2017-01-07
tony    2017-01-07      2017-01-07

（5）.查询前20%时间的订单信息分5个组，第一个组就是前20%

select
    name,orderdate,cost,ntile(5) over(order by orderdate) groupId
from business;t1 
name    orderdate       cost    groupid
jack    2017-01-01      10      1
tony    2017-01-02      15      1
tony    2017-01-04      29      1
jack    2017-01-05      46      2
tony    2017-01-07      50      2
jack    2017-01-08      55      2
jack    2017-02-03      23      3
jack    2017-04-06      42      3
mart    2017-04-08      62      3
mart    2017-04-09      68      4
mart    2017-04-11      75      4
mart    2017-04-13      94      4
neil    2017-05-10      12      5
neil    2017-06-12      80      5


select
    name,orderdate,cost
from (
    select
    name,orderdate,cost,ntile(5) over(order by orderdate) groupId
from business)t1
where groupId = 1;
name    orderdate       cost
jack    2017-01-01      10
tony    2017-01-02      15
tony    2017-01-04      29

2.6 Rank

1）函数说明

RANK() 排序相同时会重复，总数不会变 DENSE_RANK() 排序相同时会重复，总数会减少 ROW_NUMBER() 会根据顺序计算

2）数据准备

[mhk@hadoop102 datas]$ cat score.txt 
孙悟空  语文    87
孙悟空  数学    95
孙悟空  英语    68
大海    语文    94
大海    数学    56
大海    英语    84
宋宋    语文    64
宋宋    数学    86
宋宋    英语    84
婷婷    语文    65
婷婷    数学    85
婷婷    英语    78

3）需求

计算每门学科成绩排名

4）创建 hive 表并导入数据

create table score(
name string,
subject string, 
score int) 
row format delimited fields terminated by "\t";

load data local inpath '/opt/module/datas/score.txt' into table score;

5）按需求查询数据

//全局排
select *,rank() over(order by score) from score;
score.name      score.subject   score.score     rank_window_0
大海    数学    56      1
宋宋    语文    64      2
婷婷    语文    65      3
孙悟空  英语    68      4
婷婷    英语    78      5
宋宋    英语    84      6
大海    英语    84      6
婷婷    数学    85      8
宋宋    数学    86      9
孙悟空  语文    87      10
大海    语文    94      11
孙悟空  数学    95      12

select *,dense_rank() over(order by score) from score;
score.name      score.subject   score.score     dense_rank_window_0
大海    数学    56      1
宋宋    语文    64      2
婷婷    语文    65      3
孙悟空  英语    68      4
婷婷    英语    78      5
宋宋    英语    84      6
大海    英语    84      6
婷婷    数学    85      7
宋宋    数学    86      8
孙悟空  语文    87      9
大海    语文    94      10
孙悟空  数学    95      11

select *,ROW_NUMBER() over(order by score) from score;
score.name      score.subject   score.score     ROW_NUMBER_window_0
大海    数学    56      1
宋宋    语文    64      2
婷婷    语文    65      3
孙悟空  英语    68      4
婷婷    英语    78      5
宋宋    英语    84      6
大海    英语    84      7
婷婷    数学    85      8
宋宋    数学    86      9
孙悟空  语文    87      10
大海    语文    94      11
孙悟空  数学    95      12
//宋宋，大海英语成绩一样，但宋宋一直在前，而文件顺序是大海在前。这是因为当前hive是MR引擎
会经过环形缓冲区，环形缓冲区会反向溢写到磁盘
如果只有map阶段没有reduce阶段，不会走环形缓冲区，顺序也不会反着来

//学科内部进行排序
select *,rank() over(partition by subject order by score) from score;
score.name      score.subject   score.score     rank_window_0
大海    数学    56      1
婷婷    数学    85      2
宋宋    数学    86      3
孙悟空  数学    95      4
孙悟空  英语    68      1
婷婷    英语    78      2
宋宋    英语    84      3
大海    英语    84      3
宋宋    语文    64      1
婷婷    语文    65      2
孙悟空  语文    87      3
大海    语文    94      4

//各个学科前三名
select 
    *,
    rank() over(partition by subject order by score desc) rk
    from score;t1 

select
    name,
    subject,
    score
from (select 
    *,
    rank() over(partition by subject order by score desc) rk
    from score)t1
where rk<=3;
name    subject score
孙悟空  数学    95
宋宋    数学    86
婷婷    数学    85
大海    英语    84
宋宋    英语    84
婷婷    英语    78
大海    语文    94
孙悟空  语文    87
婷婷    语文    65

3. 其他常用函数

unix_timestamp:返回当前或指定时间的时间戳
hive (hive)> desc function unix_timestamp;
OK
tab_name
unix_timestamp(date[, pattern]) - Converts the time to a number

hive (hive)> select unix_timestamp();
unix_timestamp(void) is deprecated. Use current_timestamp instead.
unix_timestamp(void) is deprecated. Use current_timestamp instead.
OK
_c0
1642837012
hive (hive)> select unix_timestamp('2022-01-20','yyyy-MM-dd');
OK
_c0
1642636800


current_timestamp:当前日期加时间
current_date:当前日期
hive (hive)> select current_date;
OK
_c0
2022-01-22
hive (hive)> select current_timestamp;
OK
_c0
2022-01-22 15:36:03.213


from_unixtime:将时间戳转为日期格式
hive (hive)> desc function from_unixtime;
OK
tab_name
from_unixtime(unix_time, format) - returns unix_time in the specified format

hive (hive)> select from_unixtime(1642837012);
OK
_c0
2022-01-22 07:36:52

hive (hive)> select from_unixtime(1642837012,'yyyy-MM-dd');
OK
_c0
2022-01-22


to_date:抽取日期部分
hive (hive)> desc function to_date;
OK
tab_name
to_date(expr) - Extracts the date part of the date or datetime expression expr

hive (hive)> select to_date('2022-01-20 12:12:12');
OK
_c0
2022-01-20

year:获取年
month:获取月
day:获取日
hour:获取时
minute:获取分
second:获取秒
hive (hive)> select year('2022-01-20 12:12:12');
OK
_c0
2022

hive (hive)> select month('2022-01-20 12:12:12');
OK
_c0
1

hive (hive)> select day('2022-01-20 12:12:12');
OK
_c0
20

hive (hive)> select hour('2022-01-20 12:12:12');
OK
_c0
12

hive (hive)> select minute('2022-01-20 12:12:12');
OK
_c0
12

hive (hive)> select second('2022-01-20 12:12:12');
OK
_c0
12

weekofyear:当前时间是一年中的第几周
hive (hive)> select weekofyear('2022-01-22');
OK
_c0
3

dayofmonth:hive会默认10月35日为11月4日
hive (hive)> select dayofmonth('2000-10-35');
OK
_c0
4

months_between:两个日期间的月份
hive (hive)> desc function months_between;
OK
tab_name
months_between(date1, date2, roundOff) - returns number of months between dates date1 and date2

hive (hive)> select months_between('2022-11-11','2020-10-01');
OK
_c0
25.32258065

add_months:加月份
hive (hive)> select add_months('2022-11-11',4);
OK
_c0
2023-03-11

hive (hive)> select add_months('2022-11-11',-3);
OK
_c0
2022-08-11

datediff:两个日期相差的天数
hive (hive)> select datediff('2022-02-02','2000-02-02');
OK
_c0
8036

date_add:加天数
hive (hive)> select date_add('2022-11-11',5);
OK
_c0
2022-11-16

last_day:当前日期月份的最后一天
hive (hive)> select last_day('2022-11-11');
OK
_c0
2022-11-30
hive (hive)> select last_day('2022-02-30');
OK
_c0
2022-03-31

date_format:格式化时间
hive (hive)> desc function date_format;
OK
tab_name
date_format(date/timestamp/string, fmt) - converts a date/timestamp/string to a value of string in the format specified by the date format fmt.

hive (hive)> select date_format('2022-02-02 11:11:11','yyyy-MM-dd');
OK
_c0
2022-02-02
hive (hive)> select date_format('2022-02-02 11:11:11','yyyy/MM/dd HH:mm:ss');
OK
_c0
2022/02/02 11:11:11

//常用取整函数
round:四舍五入
hive (hive)> select round(3.14);
OK
_c0
3
hive (hive)> select round(3.54);
OK
_c0
4

ceil:向上取整
hive (hive)> select ceil(3.14);
OK
_c0
4

floor:向下取整
hive (hive)> select floor(3.14);
OK
_c0
3


//常用字符串函数操作
upper:转大写
hive (hive)> select upper('low');
OK
_c0
LOW
lower:转小写
hive (hive)> select lower('LOW');
OK
_c0
low

length:取长度
hive (hive)> select length('asfsdgfszfdas');
OK
_c0
13

trim:前后去空格
hive (hive)> select trim(' mhk  ');
OK
_c0
mhk

lpad:向左补齐，到指定长度
hive (hive)> select lpad('mhk',4,'m');
OK
_c0
mmhk

rpad:向右补齐，到指定长度
hive (hive)> select rpad('mhk',4,'m');
OK
_c0
mhkm

regexp_replace:使用正则表达式匹配目标字符串，匹配成功后替换
hive (hive)> select regexp_replace('2022-11-11','-','/');
OK
_c0
2022/11/11


//集合操作
size:返回集合中元素个数
hive (default)> select * from test1;
OK
test1.name      test1.friends   test1.children  test1.address
songsong        ["bingbing","lili"]     {"xiao song":18,"xiaoxiao song":19}     {"street":"hui long guan","city":"beijing"}
yangyang        ["caicai","susu"]       {"xiao yang":18,"xiaoxiao yang":19}     {"street":"chao yang","city":"beijing"}

hive (default)> select size(friends) from test1;
OK
_c0
2
2

map_keys: 返回map中的key
map_values:返回map中的value
hive (default)>  select map_keys(children) from test1;
OK
_c0
["xiao song","xiaoxiao song"]
["xiao yang","xiaoxiao yang"]
Time taken: 0.343 seconds, Fetched: 2 row(s)
hive (default)>  select map_values(children) from test1;
OK
_c0
[18,19]
[18,19]

array_contains:判断array中是否包含某个元素
hive (default)> select array_contains(friends,'lili') from test1;
OK
_c0
true
false
sort_array: 将array中的元素排序
hive (default)> select sort_array(friends) from test1;
OK
_c0
["bingbing","lili"]
["caicai","susu"]

grouping sets:多维分析

1001,zhangsan,male,10
1002,lisi,female,10
1003,wangwu,male,20
1004,zhaoliu,female,30
1005,jooye,female,40
1006,hongkong,male,40
1007,jisoo,female,20
1008,rosie,male,30

hive (hive)> create table staff(id string,name string,gender string,deptid string) row format delimited fields terminated by ',';
OK

hive (hive)> load data local inpath '/opt/module/datas/staff.txt' into table staff;
Loading data to table hive.staff
OK

hive (hive)> select * from staff;
OK
staff.id        staff.name      staff.gender    staff.deptid
1001    zhangsan        male    10
1002    lisi    female  10
1003    wangwu  male    20
1004    zhaoliu female  30
1005    jooye   female  40
1006    hongkong        male    40
1007    jisoo   female  20
1008    rosie   male    30

select deptid,gender,count(*) from staff group by deptid,gender grouping sets((deptid,gender),deptid,gender,());

deptid  gender  _c2
NULL    NULL    8       -- deptid,gender都为null，求的是总人数 --    
10      NULL    2       -- gender为null，是按照deptid求的各个部门的总人数 -- 
20      NULL    2
30      NULL    2
40      NULL    2
NULL    female  4       -- deptid为null，是按照gender求的男女各多少人 --
10      female  1       -- deptid和gender都不为null，是按照deptid和gender求的各个部门男女各多少人 --
20      female  1
30      female  1
40      female  1
NULL    male    4
10      male    1
20      male    1
30      male    1
40      male    1


//用HQL做wordcount
hello,mhk
hello,hive,hadoop
hello,zookeeper,spark,flink
hive,spark

create table input(word string);

load data local inpath '/opt/module/datas/input.txt' into table input;

hive (hive)> select * from input;
OK
input.word
hello,mhk
hello,hive,hadoop
hello,zookeeper,spark,flink
hive,spark

hive (hive)> select explode(split(word,',')) from input;
OK
col
hello
mhk
hello
hive
hadoop
hello
zookeeper
spark
flink
hive
spark

select word,count(*) from (select explode(split(word,',')) word from input)t1 
group by word;

word    _c1
flink   1
hadoop  1
hello   3
hive    2
mhk     1
spark   2
zookeeper       1