2022-01-23 | 标签 Hadoop Hive

1. 分区表

分区表实际上就是对应一个 HDFS 文件系统上的独立的文件夹，该文件夹下是该分区所有的数据文件。Hive 中的分区就是分目录，把一个大的数据集根据业务需要分割成小的数据集。在查询时通过 WHERE 子句中的表达式选择查询所需要的指定的分区，这样的查询效率会提高很多。

1.1 分区表的基本操作

1）引入分区表（需要根据日期对日志进行管理, 通过部门信息模拟）

-rw-rw-r--. 1 mhk mhk  36 1月  20 15:40 dept_0118.log
-rw-rw-r--. 1 mhk mhk  33 1月  20 15:41 dept_0119.log
-rw-rw-r--. 1 mhk mhk  25 1月  20 15:42 dept_0120.log

2）创建分区表语法

create table dept_partition( deptno int, dname string, loc string)
partitioned by (day string)
row format delimited fields terminated by '\t';

注意：分区字段不能是表中已经存在的数据，可以将分区字段看作表的伪列。

3）加载数据到分区表中

load data local inpath 
'/opt/module/datas/dept_0118.log' into table dept_partition 
partition(day='2022-01-18');

load data local inpath 
'/opt/module/datas/dept_0119.log' into table dept_partition 
partition(day='2022-01-19');

load data local inpath 
'/opt/module/datas/dept_0120.log' into table dept_partition 
partition(day='2022-01-20');

注意：分区表加载数据时，必须指定分区截屏2022-01-23 下午5.40.20.png

4）查询分区表中数据

单分区查询

hive (hive)> select * from dept_partition where day='2022-01-20';
OK
dept_partition.deptno   dept_partition.dname    dept_partition.loc      dept_partition.day
50      TEST    2000    2022-01-20
60      DEV     1900    2022-01-20
hive (hive)> select * from dept_partition where deptno=50 or deptno=60;
OK
dept_partition.deptno   dept_partition.dname    dept_partition.loc      dept_partition.day
50      TEST    2000    2022-01-20
60      DEV     1900    2022-01-20

两种查询，第一种效率更高，第二种要遍历整张表，在生产环境下数据量很大的时候，第二种效率很低

多分区联合查询

select * from dept_partition where day='2022-01-18'
union
select * from dept_partition where day='2022-01-19'
union
select * from dept_partition where day='2022-01-20';

_u2.deptno      _u2.dname       _u2.loc _u2.day
40      OPERATIONS      1700    2022-01-19
50      TEST    2000    2022-01-20
60      DEV     1900    2022-01-20
20      RESEARCH        1800    2022-01-18
10      ACCOUNTING      1700    2022-01-18
30      SALES   1900    2022-01-19
或
select * from dept_partition where day='2022-01-18' or day='2022-01-19' or day='2022-01-20';

dept_partition.deptno   dept_partition.dname    dept_partition.loc      dept_partition.day
10      ACCOUNTING      1700    2022-01-18
20      RESEARCH        1800    2022-01-18
30      SALES   1900    2022-01-19
40      OPERATIONS      1700    2022-01-19
50      TEST    2000    2022-01-20
60      DEV     1900    2022-01-20

经测试第二种写法不走MR，运行更快

5）增加分区

hive (hive)> alter table dept_partition add partition(day='2022-01-21') partition(day='2022-01-22') 
partition(day='2022-01-23');

6）删除分区

hive (hive)> alter table dept_partition drop partition(day='2022-01-22') , partition(day='2022-01-23');
Dropped the partition day=2022-01-22
Dropped the partition day=2022-01-23

增加分区，分区之间是空格；删除分区，分区之间是逗号

7）查看分区表有多少分区

hive (hive)> show partitions dept_partition;
OK
partition
day=2022-01-18
day=2022-01-19
day=2022-01-20
day=2022-01-21
day=2022-01-22
day=2022-01-23
day=2022-01-24

8）查看分区表结构

hive (hive)> desc formatted dept_partition;
OK
col_name        data_type       comment
# col_name              data_type               comment             
deptno                  int                                         
dname                   string                                      
loc                     string                                      
                 
# Partition Information          
# col_name              data_type               comment             
day                     string                                      
...

1.2 二级分区

思考: 如何一天的日志数据量也很大，如何再将数据拆分?

1）创建二级分区表

create table dept_partition2(deptno int, dname string, loc string)
partitioned by (day string, hour string)
row format delimited fields terminated by '\t';

2）正常的加载数据

（1）加载数据到二级分区表中

load data local inpath 
'/opt/module/datas/dept_0118.log' into table
dept_partition2 partition(day='2022-01-18', hour='12');

（2）查询分区数据

select * from dept_partition2 where day='2022-01-18' and hour='12';

dept_partition2.deptno  dept_partition2.dname   dept_partition2.loc     dept_partition2.day     dept_partition2.hour
10      ACCOUNTING      1700    2022-01-18      12
20      RESEARCH        1800    2022-01-18      12

3）把数据直接上传到分区目录上，让分区表和数据产生关联的三种方式

hadoop fs -mkdir /user/hive/warehouse/hive.db/dept_partition/day=2022-01-22
hadoop fs -put /opt/module/datas/dept_0120.log  /user/hive/warehouse/hive.db/dept_partition/day=2022-01-22

查询数据（查询不到刚上传的数据）

select * from dept_partition where day='2022-01-22';

（1）方式一：上传数据后修复

hive (hive)> msck repair table dept_partition;
OK
Partitions not in metastore:    dept_partition:day=2022-01-22
Repair: Added partition to metastore dept_partition:day=2022-01-22

显示把我们自己创建的 day=2022-01-22 文件夹添加了进去，添加到了元数据库，可以在Navicat里查看PATITIONS表中查看,也可以在hive中用show partitions dept_partition显示分区信息查看

（2）方式二：上传数据后添加分区

hadoop fs -mkdir /user/hive/warehouse/hive.db/dept_partition/day=2022-01-23
hadoop fs -put /opt/module/datas/dept_0120.log  /user/hive/warehouse/hive.db/dept_partition/day=2022-01-23

执行添加分区

alter table dept_partition add partition(day='2022-01-23');

select * from dept_partition where day='2022-01-23';
OK
dept_partition.deptno   dept_partition.dname    dept_partition.loc      dept_partition.day
50      TEST    2000    2022-01-23
60      DEV     1900    2022-01-23

（3）方式三：创建文件夹后 load 数据到分区 load命令不光简简单单的把数据加载进去，也会修改元数据

hadoop fs -mkdir /user/hive/warehouse/hive.db/dept_partition/day=2022-01-24

load data local inpath 
'/opt/module/datas/dept_0120.log' into table
dept_partition partition(day='2022-01-24');

select * from dept_partition where day='2022-01-24';
OK
dept_partition.deptno   dept_partition.dname    dept_partition.loc      dept_partition.day
50      TEST    2000    2022-01-24
60      DEV     1900    2022-01-24

1.3 动态分区调整

关系型数据库中，对分区表 Insert 数据时候，数据库自动会根据分区字段的值，将数据插入到相应的分区中，Hive 中也提供了类似的机制，即动态分区(Dynamic Partition)，只不过，使用 Hive 的动态分区，需要进行相应的配置。

1）开启动态分区参数设置

（1）开启动态分区功能（默认 true，开启）

hive.exec.dynamic.partition=true 

（2）设置为非严格模式（动态分区的模式，默认 strict，表示必须指定至少一个分区为静态分区，nonstrict 模式表示允许所有的分区字段都可以使用动态分区。）

hive.exec.dynamic.partition.mode=nonstrict 

（3）在所有执行 MR 的节点上，最大一共可以创建多少个动态分区。默认 1000

hive.exec.max.dynamic.partitions=1000 

（4）在每个执行 MR 的节点上，最大可以创建多少个动态分区。该参数需要根据实际的数据来设定。比如：源数据中包含了一年的数据，即 day 字段有 365 个值，那么该参数就需要设置成大于 365，如果使用默认值 100，则会报错。

hive.exec.max.dynamic.partitions.pernode=100 

（5）整个 MR Job 中，最大可以创建多少个 HDFS 文件。默认 100000

hive.exec.max.created.files=100000 

（6）当有空分区生成时，是否抛出异常。一般不需要设置。默认 false

hive.error.on.empty.partition=false 	

2）案例实操

需求：将 dept 表中的数据按照部门编号（deptno 字段），插入到目标表 dept_partition 的相应分区中。（1）创建目标分区表

create table dept_no_partition(dname string,loc string)
partitioned by (deptno int)
row format delimited fields terminated by '\t';

如果这样指定的话，就是静态分区

insert into table dept_no_partition partition(deptno='70') --静态分区--
select dname,loc from dept;

（2）设置动态分区

insert into table dept_no_partition partition(deptno) 
select dname,loc,deptno from dept;
FAILED: SemanticException [Error 10096]: Dynamic partition strict mode requires at least 
one static partition column. 
To turn this off set hive.exec.dynamic.partition.mode=nonstrict

set hive.exec.dynamic.partition.mode=nonstrict;

（3）查看目标分区表的分区情况

select * from dept_no_partition;

dept_no_partition.dname dept_no_partition.loc   dept_no_partition.deptno
ACCOUNTING      1700    10
RESEARCH        1800    20
SALES   1900    30
OPERATIONS      1700    40

截屏2022-01-23 下午6.32.57.png

思考：目标分区表是如何匹配到分区字段的？因为 dept_no_partition 表中只有两个字段，所以当我们查询了三个字段时（多了deptno字段），所以系统默认以最后一个字段deptno为分区名，因为分区表的分区字段默认也是该表中的字段，且依次排在表中字段的最后面。所以分区需要分区的字段只能放在后面，不能把顺序弄错。如果我们查询了四个字段的话，则会报错，因为该表加上分区字段也才三个。要注意系统是根据查询字段的位置推断分区名的，而不是字段名称。

3）hive3 动态分区新特性

严格模式下也可以

create table dept_no_partition2(dname string,loc string)
partitioned by (deptno int)
row format delimited fields terminated by '\t';

insert into table dept_no_partition --不写分区字段--
select dname,loc,deptno from dept;

2. 分桶表

分区提供一个隔离数据和优化查询的便利方式。不过，并非所有的数据集都可形成合理的分区。对于一张表或者分区，Hive 可以进一步组织成桶，也就是更为细粒度的数据范围划分。分桶是将数据集分解成更容易管理的若干部分的另一个技术。分区针对的是数据的存储路径；分桶针对的是数据文件。

1）先创建分桶表

（1）数据准备

[mhk@hadoop102 datas]$ cat stu.txt 
  ss1
  ss2
  ss3
  ss4
  ss5
  ss6
  ss7
  ss8
  ss9
  ss10
  ss11
  ss12
  ss13
  ss14
  ss15
  ss16

（2）创建分桶表

create table stu_buck(id int, name string)
clustered by(id) 
into 4 buckets
row format delimited fields terminated by '\t';

（3）查看表结构

hive (hive)> desc formatted stu_buck;
Num Buckets:            4                        
Bucket Columns:         [id]  

（4）导入数据到分桶表中，load 的方式

load data local inpath 'opt/module/datas/stu.txt' into table stu_buck;

（5）查看创建的分桶表中是否分成 4 个桶截屏2022-01-23 下午6.41.11.png （6）查询分桶的数据

hive (hive)> select * from stu_buck;
OK
stu_buck.id     stu_buck.name
  ss16
  ss12
  ss8
  ss4
  ss9
  ss5
  ss1
  ss13
  ss10
  ss2
  ss6
  ss14
  ss3
  ss11
  ss7
  ss15

（7）分桶规则：根据结果可知：Hive 的分桶采用对分桶字段的值进行哈希，然后除以桶的个数求余的方式决定该条记录存放在哪个桶当中

2）分桶表操作需要注意的事项:

（1）reduce 的个数设置为-1,让 Job 自行决定需要用多少个 reduce 或者将 reduce 的个数设置为大于等于分桶表的桶数（2）从 hdfs 中 load 数据到分桶表中，避免本地文件找不到问题（3）不要使用本地模式

3）insert 方式将数据导入分桶表

insert into table stu_buck select * from student_insert; 	

3. 抽样查询

对于非常大的数据集，有时用户需要使用的是一个具有代表性的查询结果而不是全部结果。Hive 可以通过对表进行抽样来满足这个需求。语法: TABLESAMPLE(BUCKET x OUT OF y) 查询表 stu_buck 中的数据。

hive (hive)> select * from stu_buck tablesample(bucket 1 out of 4 on id);
OK
stu_buck.id     stu_buck.name
1016    ss16
1004    ss4
1009    ss9
1002    ss2
1003    ss3

x表示从哪个bucket开始抽取。例如，table总bucket数为4，bucket 4 out of 4，表示总共抽取（4/4=）1个bucket的数据，抽取第4个bucket的数据。

hive根据y的大小，决定抽样的比例。例如，table总共分了4份（4个bucket），当y=2时，抽取(4/2=)2个bucket的数据，当y=8时，抽取(4/8=)1/2个bucket的数据。

注意：x 的值必须小于等于 y 的值，否则

FAILED: SemanticException [Error 10061]: 
Numerator should not be bigger than denominator in sample clause for table stu_buck