2022-01-04 | 标签 Hadoop HDFS MapReduce Yarn

1.Hadoop小文件优化方法

1.1 Hadoop小文件弊端

HDFS上每个文件都要在NameNode上创建对应的元数据，这个元数据的大小约为150byte，这样当小文件比较多的时候，就会产生很多的元数据文件，一方面会大量占用NameNode的内存空间，另一方面就是元数据文件过多，使得寻址索引速度变慢。小文件过多，在进行MR计算时，会生成过多切片，需要启动过多的MapTask。每个MapTask处理的数据量小，导致MapTask的处理时间比启动时间还小，白白消耗资源。

1.2 Hadoop小文件解决方案

在数据采集的时候，就将小文件或小批数据合成大文件再上传HDFS（数据源头）
Hadoop Archive（存储方向）是一个高效的将小文件放入HDFS块中的文件存档工具，能够将多个小文件打包成一个HAR文件，从而达到减少NameNode的内存使
CombineTextInputFormat（计算方向） CombineTextInputFormat用于将多个小文件在切片过程中生成一个单独的切片或者少量的切片。
开启uber模式，实现JVM重用（计算方向）默认情况下，每个Task任务都需要启动一个JVM来运行，如果Task任务计算的数据量很小，我们可以让同一个Job的多个Task运行在一个JVM中，不必为每个Task都开启一个JVM。

（1）未开启uber模式，在/input路径上上传多个小文件并执行wordcount程序
```
[mhk@hadoop102 hadoop-3.1.3]$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar wordcount /input /output2
```
（2）观察控制台
```
2021-02-14 16:13:50,607 INFO mapreduce.Job: Job job_1613281510851_0002 running in uber mode : false
```

（3）观察http://hadoop103:8088/cluster

（4）开启uber模式，在mapred-site.xml中添加如下配置

<!--  开启uber模式，默认关闭 -->
<property>
  	<name>mapreduce.job.ubertask.enable</name>
  	<value>true</value>
</property>

<!-- uber模式中最大的mapTask数量，可向下修改  --> 
<property>
  	<name>mapreduce.job.ubertask.maxmaps</name>
  	<value>9</value>
</property>
<!-- uber模式中最大的reduce数量，可向下修改 -->
<property>
  	<name>mapreduce.job.ubertask.maxreduces</name>
  	<value>1</value>
</property>
<!-- uber模式中最大的输入数据量，默认使用dfs.blocksize 的值，可向下修改 -->
<property>
  	<name>mapreduce.job.ubertask.maxbytes</name>
  	<value></value>
</property>

（5）分发配置

[mhk@hadoop102 hadoop]$ xsync mapred-site.xml

（6）再次执行wordcount程序

[mhk@hadoop102 hadoop-3.1.3]$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar wordcount /input /output2

（7）观察控制台

2021-02-14 16:28:36,198 INFO mapreduce.Job: Job job_1613281510851_0003 running in uber mode : true

（8）观察http://hadoop103:8088/cluster

2.HDFS参数调优

2.1 修改：hadoop-env.sh

hadoop-env.sh中描述Hadoop的内存是动态分配的

# The maximum amount of heap to use (Java -Xmx).  If no unit
# is provided, it will be converted to MB.  Daemons will
# prefer any Xmx setting in their respective _OPT variable.
# There is no default; the JVM will autoscale based upon machine
# memory size.
# export HADOOP_HEAPSIZE_MAX=

# The minimum amount of heap to use (Java -Xms).  If no unit
# is provided, it will be converted to MB.  Daemons will
# prefer any Xms setting in their respective _OPT variable.
# There is no default; the JVM will autoscale based upon machine
# memory size.
# export HADOOP_HEAPSIZE_MIN=
HADOOP_NAMENODE_OPTS=-Xmx102400m

查看NameNode和DataNode占用内存

[mhk@hadoop102 ~]$ jps
NodeManager
NameNode
JobHistoryServer
DataNode
Jps

[mhk@hadoop102 ~]$ jmap -heap 2611
Heap Configuration:
   MaxHeapSize              = 1031798784 (984.0MB)

[mhk@hadoop102 ~]$ jmap -heap 2744
Heap Configuration:
   MaxHeapSize              = 1031798784 (984.0MB)

查看发现hadoop102上的NameNode和DataNode占用内存都是自动分配的，且相等。不是很合理

具体修改：hadoop-env.sh

export HDFS_NAMENODE_OPTS="-Dhadoop.security.logger=INFO,RFAS -Xmx1024m"

export HDFS_DATANODE_OPTS="-Dhadoop.security.logger=ERROR,RFAS -Xmx1024m"

2.2 修改hdfs-site.xml

<!-- NameNode有一个工作线程池，默认值是10 -->
<!-- 企业经验：dfs.namenode.handler.count=20*ln(Cluster Size)，比如集群规模（DataNode台数）为3台时，此参数设置为21 -->
<property>
    <name>dfs.namenode.handler.count</name>
    <value>21</value>
</property>

2.3 修改core-site.xml

开启回收站功能，可以将删除的文件在不超时的情况下，恢复原数据，起到防止误删除、备份等作用。

<!-- 配置垃圾回收时间为60分钟 -->
<property>
    <name>fs.trash.interval</name>
    <value>60</value>
</property>

3 MapReduce参数调优

MapReduce程序效率的瓶颈在于两点：

计算机性能 CPU、内存、磁盘、网络
I/O操作优化（1）数据倾斜（2）Map运行时间太长，导致Reduce等待过久（3）小文件过多

3.1 Map阶段的优化

自定义分区，减少数据倾斜; 定义类，继承Partitioner接口，重写getPartition方法
减少溢写的次数
```
mapreduce.task.io.sort.mb 
```
Shuffle的环形缓冲区大小，默认100m，可以提高到200m
```
mapreduce.map.sort.spill.percent
```
环形缓冲区溢出的阈值，默认80% ，可以提高的90%
增加每次Merge合并次数
```
mapreduce.task.io.sort.factor
```
默认10，可以提高到20
在不影响业务结果的前提条件下可以提前采用Combiner
```
job.setCombinerClass(xxxReducer.class);
```

为了减少磁盘IO，可以采用Snappy或者LZO压缩

conf.setBoolean("mapreduce.map.output.compress", true);
conf.setClass("mapreduce.map.output.compress.codec", SnappyCodec.class,CompressionCodec.class);

```
mapreduce.map.memory.mb
```
默认MapTask内存上限1024MB。可以根据128m数据对应1G内存原则提高该内存。
```
mapreduce.map.java.opts
```
控制MapTask堆内存大小。（如果内存不够，报：java.lang.OutOfMemoryError）
```
mapreduce.map.cpu.vcores 
```
默认MapTask的CPU核数1。计算密集型任务可以增加CPU核数
异常重试
```
mapreduce.map.maxattempts
```
每个Map Task最大重试次数，一旦重试次数超过该值，则认为Map Task运行失败，默认值：4。根据机器性能适当提高。

3.2 Reduce阶段的优化

```
mapreduce.reduce.shuffle.parallelcopies
```
每个Reduce去Map中拉取数据的并行数，默认值是5。可以提高到10。
```
mapreduce.reduce.shuffle.input.buffer.percent
```
Buffer大小占Reduce可用内存的比例，默认值0.7。可以提高到0.8
```
mapreduce.reduce.shuffle.merge.percent
```
Buffer中的数据达到多少比例开始写入磁盘，默认值0.66。可以提高到0.75
```
mapreduce.reduce.memory.mb 
```
默认ReduceTask内存上限1024MB，根据128m数据对应1G内存原则，适当提高内存到4-6G
```
mapreduce.reduce.java.opts
```
控制ReduceTask堆内存大小。（如果内存不够，报：java.lang.OutOfMemoryError）
```
mapreduce.reduce.cpu.vcores
```
默认ReduceTask的CPU核数1个。可以提高到2-4个
```
mapreduce.reduce.maxattempts
```
每个Reduce Task最大重试次数，一旦重试次数超过该值，则认为Map Task运行失败，默认值：4。
```
mapreduce.job.reduce.slowstart.completedmaps
```
当MapTask完成的比例达到该值后才会为ReduceTask申请资源。默认是0.05。
```
mapreduce.task.timeout
```
如果一个Task在一定时间内没有任何进入，即不会读取新的数据，也没有输出数据，则认为该Task处于Block状态，可能是卡住了，也许永远会卡住，为了防止因为用户程序永远Block住不退出，则强制设置了一个该超时时间（单位毫秒），默认是600000（10分钟）。如果你的程序对每条输入数据的处理时间过长，建议将该参数调大。
如果可以不用Reduce，尽可能不用

3.3 修改mapred-site.xml

<!-- 环形缓冲区大小，默认100m -->
<property>
  <name>mapreduce.task.io.sort.mb</name>
  <value>100</value>
</property>

<!-- 环形缓冲区溢写阈值，默认0.8 -->
<property>
  <name>mapreduce.map.sort.spill.percent</name>
  <value>0.80</value>
</property>

<!-- merge合并次数，默认10个 -->
<property>
  <name>mapreduce.task.io.sort.factor</name>
  <value>10</value>
</property>

<!-- maptask内存，默认1g； maptask堆内存大小默认和该值大小一致mapreduce.map.java.opts -->
<property>
  <name>mapreduce.map.memory.mb</name>
  <value>-1</value>
  <description>The amount of memory to request from the scheduler for each    map task. If this is not specified or is non-positive, it is inferred from mapreduce.map.java.opts and mapreduce.job.heap.memory-mb.ratio. If java-opts are also not specified, we set it to 1024.
  </description>
</property>

<!-- matask的CPU核数，默认1个 -->
<property>
  <name>mapreduce.map.cpu.vcores</name>
  <value>1</value>
</property>

<!-- matask异常重试次数，默认4次 -->
<property>
  <name>mapreduce.map.maxattempts</name>
  <value>4</value>
</property>

<!-- 每个Reduce去Map中拉取数据的并行数。默认值是5 -->
<property>
  <name>mapreduce.reduce.shuffle.parallelcopies</name>
  <value>5</value>
</property>

<!-- Buffer大小占Reduce可用内存的比例，默认值0.7 -->
<property>
  <name>mapreduce.reduce.shuffle.input.buffer.percent</name>
  <value>0.70</value>
</property>

<!-- Buffer中的数据达到多少比例开始写入磁盘，默认值0.66。 -->
<property>
  <name>mapreduce.reduce.shuffle.merge.percent</name>
  <value>0.66</value>
</property>

<!-- reducetask内存，默认1g；reducetask堆内存大小默认和该值大小一致mapreduce.reduce.java.opts -->
<property>
  <name>mapreduce.reduce.memory.mb</name>
  <value>-1</value>
  <description>The amount of memory to request from the scheduler for each    reduce task. If this is not specified or is non-positive, it is inferred
    from mapreduce.reduce.java.opts and mapreduce.job.heap.memory-mb.ratio.
    If java-opts are also not specified, we set it to 1024.
  </description>
</property>

<!-- reducetask的CPU核数，默认1个 -->
<property>
  <name>mapreduce.reduce.cpu.vcores</name>
  <value>2</value>
</property>

<!-- reducetask失败重试次数，默认4次 -->
<property>
  <name>mapreduce.reduce.maxattempts</name>
  <value>4</value>
</property>

<!-- 当MapTask完成的比例达到该值后才会为ReduceTask申请资源。默认是0.05 -->
<property>
  <name>mapreduce.job.reduce.slowstart.completedmaps</name>
  <value>0.05</value>
</property>

<!-- 如果程序在规定的默认10分钟内没有读到数据，将强制超时退出 -->
<property>
  <name>mapreduce.task.timeout</name>
  <value>600000</value>
</property>

3.4 MapReduce数据倾斜问题

3.4.1 数据倾斜现象

数据频率倾斜——某一个区域的数据量要远远大于其他区域。

数据大小倾斜——部分记录的大小远远大于平均值。

3.4.2 减少数据倾斜的方法

首先检查是否空值过多造成的数据倾斜

生产环境，可以直接过滤掉空值；如果想保留空值，就自定义分区，将空值加随机数打散。最后再二次聚合。
能在map阶段提前处理，最好先在Map阶段处理。如：Combiner、MapJoin
设置多个reduce个数

4 Yarn参数调优

4.1 ResourceManager相关

```
yarn.resourcemanager.scheduler.class
```

配置调度器，默认容量

yarn.resourcemanager.scheduler.client.thread-count

ResourceManager处理调度器请求的线程数量，默认50

4.2 NodeManager相关

yarn.nodemanager.resource.detect-hardware-capabilities

是否让yarn自己检测硬件进行配置，默认false

yarn.nodemanager.resource.count-logical-processors-as-cores

是否将虚拟核数当作CPU核数，默认false

```
yarn.nodemanager.resource.memory-mb
```
NodeManager使用内存，默认8G
```
yarn.nodemanager.resource.system-reserved-memory-mb 
```
NodeManager为系统保留多少内存

以上二个参数配置一个即可
```
yarn.nodemanager.resource.pcores-vcores-multiplier
```
虚拟核数和物理核数乘数，例如：4核8线程，该参数就应设为2，默认1.0
```
yarn.nodemanager.resource.cpu-vcores 
```
NodeManager使用CPU核数，默认8个
```
yarn.nodemanager.pmem-check-enabled         
```
是否开启物理内存检查限制container，默认打开

yarn.nodemanager.vmem-check-enabled

是否开启虚拟内存检查限制container，默认打开

```
yarn.nodemanager.vmem-pmem-ratio 
```
虚拟内存物理内存比例，默认2.1

4.3 Container相关

```
yarn.scheduler.minimum-allocation-mb	
```
容器最最小内存，默认1G
```
yarn.scheduler.maximum-allocation-mb
```
容器最最大内存，默认8G

yarn.scheduler.minimum-allocation-vcores

容器最小CPU核数，默认1个

yarn.scheduler.maximum-allocation-vcores

容器最大CPU核数，默认4个

4.4 修改yarn-site.xml配置参数

<!-- 选择调度器，默认容量 -->
<property>
	<description>The class to use as the resource scheduler.</description>
	<name>yarn.resourcemanager.scheduler.class</name>
	<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
</property>

<!-- ResourceManager处理调度器请求的线程数量,默认50；如果提交的任务数大于50，可以增加该值，但是不能超过3台 * 4线程 = 12线程（去除其他应用程序实际不能超过8） -->
<property>
	<description>Number of threads to handle scheduler interface.</description>
	<name>yarn.resourcemanager.scheduler.client.thread-count</name>
	<value>8</value>
</property>

<!-- 是否让yarn自动检测硬件进行配置，默认是false，如果该节点有很多其他应用程序，建议手动配置。如果该节点没有其他应用程序，可以采用自动 -->
<property>
	<description>Enable auto-detection of node capabilities such as
	memory and CPU.
	</description>
	<name>yarn.nodemanager.resource.detect-hardware-capabilities</name>
	<value>false</value>
</property>

<!-- 是否将虚拟核数当作CPU核数，默认是false，采用物理CPU核数 -->
<property>
	<description>Flag to determine if logical processors(such as
	hyperthreads) should be counted as cores. Only applicable on Linux
	when yarn.nodemanager.resource.cpu-vcores is set to -1 and
	yarn.nodemanager.resource.detect-hardware-capabilities is true.
	</description>
	<name>yarn.nodemanager.resource.count-logical-processors-as-cores</name>
	<value>false</value>
</property>

<!-- 虚拟核数和物理核数乘数，默认是1.0 -->
<property>
	<description>Multiplier to determine how to convert phyiscal cores to
	vcores. This value is used if yarn.nodemanager.resource.cpu-vcores
	is set to -1(which implies auto-calculate vcores) and
	yarn.nodemanager.resource.detect-hardware-capabilities is set to true. The	number of vcores will be calculated as	number of CPUs * multiplier.
	</description>
	<name>yarn.nodemanager.resource.pcores-vcores-multiplier</name>
	<value>1.0</value>
</property>

<!-- NodeManager使用内存数，默认8G，修改为4G内存 -->
<property>
	<description>Amount of physical memory, in MB, that can be allocated 
	for containers. If set to -1 and
	yarn.nodemanager.resource.detect-hardware-capabilities is true, it is
	automatically calculated(in case of Windows and Linux).
	In other cases, the default is 8192MB.
	</description>
	<name>yarn.nodemanager.resource.memory-mb</name>
	<value>4096</value>
</property>

<!-- nodemanager的CPU核数，不按照硬件环境自动设定时默认是8个，修改为4个 -->
<property>
	<description>Number of vcores that can be allocated
	for containers. This is used by the RM scheduler when allocating
	resources for containers. This is not used to limit the number of
	CPUs used by YARN containers. If it is set to -1 and
	yarn.nodemanager.resource.detect-hardware-capabilities is true, it is
	automatically determined from the hardware in case of Windows and Linux.
	In other cases, number of vcores is 8 by default.</description>
	<name>yarn.nodemanager.resource.cpu-vcores</name>
	<value>4</value>
</property>

<!-- 容器最小内存，默认1G -->
<property>
	<description>The minimum allocation for every container request at the RM	in MBs. Memory requests lower than this will be set to the value of this	property. Additionally, a node manager that is configured to have less memory	than this value will be shut down by the resource manager.
	</description>
	<name>yarn.scheduler.minimum-allocation-mb</name>
	<value>1024</value>
</property>

<!-- 容器最大内存，默认8G，修改为2G -->
<property>
	<description>The maximum allocation for every container request at the RM	in MBs. Memory requests higher than this will throw an	InvalidResourceRequestException.
	</description>
	<name>yarn.scheduler.maximum-allocation-mb</name>
	<value>2048</value>
</property>

<!-- 容器最小CPU核数，默认1个 -->
<property>
	<description>The minimum allocation for every container request at the RM	in terms of virtual CPU cores. Requests lower than this will be set to the	value of this property. Additionally, a node manager that is configured to	have fewer virtual cores than this value will be shut down by the resource	manager.
	</description>
	<name>yarn.scheduler.minimum-allocation-vcores</name>
	<value>1</value>
</property>

<!-- 容器最大CPU核数，默认4个，修改为2个 -->
<property>
	<description>The maximum allocation for every container request at the RM	in terms of virtual CPU cores. Requests higher than this will throw an
	InvalidResourceRequestException.</description>
	<name>yarn.scheduler.maximum-allocation-vcores</name>
	<value>2</value>
</property>

<!-- 虚拟内存检查，默认打开，修改为关闭 -->
<property>
	<description>Whether virtual memory limits will be enforced for
	containers.</description>
	<name>yarn.nodemanager.vmem-check-enabled</name>
	<value>false</value>
</property>

<!-- 虚拟内存和物理内存设置比例,默认2.1 -->
<property>
	<description>Ratio between virtual memory to physical memory when	setting memory limits for containers. Container allocations are	expressed in terms of physical memory, and virtual memory usage	is allowed to exceed this allocation by this ratio.
	</description>
	<name>yarn.nodemanager.vmem-pmem-ratio</name>
	<value>2.1</value>
</property>