MapReduce v2(YARN)是未来替代MapReduce v1的计算框架,其设计克服了版本一在超大集群环境下的瓶颈,YARN的介绍见这里。
本文介绍YARN集群的搭建,其前提是HDFS集群完成搭建,这里使用NameNode HA来提高可靠性。
注:搭建基于CDH 4.3,根据Cloudera官方文档的建议,YARN框架暂未成熟,后续版本可能会与现有或之前版本不兼容,如非必要不建议在正式环境中使用。就我们而言,没有历史包袱,愿意尝试之,摸索前进。
1.安装
配置Yum服务,略,请搜索本站相关文章。
在Resource Manager节点:
shell> yum install hadoop-yarn-resourcemanager -y
在NodeManager节点:
shell> yum install hadoop-yarn-nodemanager -y
2.配置
公共配置(包括RM, NM节点):
要启用YARN框架,需要在mapred-site.xml中加入:
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
为方便在每个节点上处理应用,yarn-site.xml
<property> <name>yarn.application.classpath</name> <value> $HADOOP_CONF_DIR, $HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*, $HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*, $HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*, $YARN_HOME/*,$YARN_HOME/lib/* </value> <description>CLASSPATH for YARN applications. A comma-separated list of CLASSPATH entries</description> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce.shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <!-- RM scheduler interface address --> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>CHBM220:8030</value> <description>The address of the scheduler interface</description> </property> <!-- RM manager interface address --> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>CHBM220:8031</value> </property> <!-- RM application manager interface address --> <property> <name>yarn.resourcemanager.address</name> <value>CHBM220:8032</value> <description>The address of the applications manager interface in the RM</description> </property> <!-- RM admin interface address --> <property> <name>yarn.resourcemanager.admin.address</name> <value>CHBM220:8033</value> <description>The address of the RM admin interface</description> </property> <!-- yarn data local directory --> <property> <name>yarn.nodemanager.local-dirs</name> <value>file:///data/1/yarn/local,file:///data/2/yarn/local</value> <description>List of directories to store localized files in. An application's localized file directory will be found in: ${yarn.nodemanager.local-dirs}/usercache/${user}/appcache/application_${appid}. Individual containers' work directories, called container_${contid}, will be subdirectories of this</description> </property> <property> <name>yarn.nodemanager.log-dirs</name> <value>file:///data/1/yarn/logs,file:///data/2/yarn/logs</value> <description>Where to store container logs. An application's localized log directory will be found in ${yarn.nodemanager.log-dirs}/application_${appid}. Individual containers' log directories will be below this, in directories named container_{$contid}. Each container directory will contain the files stderr, stdin, and syslog generated by that container</description> </property> <property> <name>yarn.nodemanager.remote-app-log-dir</name> <value>/var/log/hadoop-yarn/apps</value> <description>Where to aggregate logs to(HDFS)</description> </property> <property> <name>yarn.app.mapreduce.am.staging-dir</name> <value>/user</value> </property>
3.创建相应本地目录
创建Application的本地文件目录(yarn.nodemanager.local-dirs):
一个Application的信息对应到${yarn.nodemanager.local-dirs}/usercache/${user}/appcache/application_${appid}。独立的Container的工作目录为container_${contid}。
shell> mkdir -p /data/1/yarn/local /data/2/yarn/local
创建本地Container日志存储目录(yarn.nodemanager.log-dirs),会存储container生成的stderr, stdin和syslog信息。
shell> mkdir -p /data/1/yarn/logs /data/2/yarn/logs
配置目录的权限:
shell> chown -R yarn:yarn /data/1/yarn/local /data/2/yarn/local
shell> chown -R yarn:yarn /data/1/yarn/logs /data/2/yarn/logs
4.部署JobHistory Server
安装YARN架构的话,应该也安装JobHistory Server。
shell> yum install yarn-jobserver
配置,在mapred-site.xml中添加
mapreduce.jobhistory.address :jobhistory提供服务的host:port,如historyserver.company.com:10020
mapreduce.jobhistory.webapp.address :jobhistory的http服务host:port,如historyserver.company.com:19888
5.配置YARN的Staging目录
Staging目录是YARN执行job时存放临时文件的地方,默认它会在HDFS上创建/tmp/hadoop-yarn/staging目录,会因权限问题导致用户无法执行job。为了避免这个问题,手工指定和创建比较合适:
<property>
<name>yarn.app.mapreduce.am.staging-dir</name>
<value>/user</value>
</property>
一旦HDFS集群启动,应创建/user目录和history子目录。
shell> sudo -u hdfs hadoop fs -mkdir /user/history
shell> sudo -u hdfs hadoop fs -chmod -R 1777 /user/history
shell> sudo -u hdfs hadoop fs -chown yarn /user/history
也可以如下方式处理:
1)在yarn-site.xm中配置mapreduce.jobhistory.intermediate-done-dir和mapreduce.jobhistory.done-dir
2)创建者两个目录
3)设置mapreduce.jobhistory.intermediate-done-dir权限1777
4)设置mapreduce.jobhistory.done-dir权限750
6.配置同步
将所有相关配置同步到所有节点。
7.创建HDFS /tmp目录
如果HDFS中没有手动创建/tmp目录,而是由部分程序自动创建,那么其权限会影响其他应用使用该目录,所以HDFS的/tmp目录应该手动创建。
shell> sudo -u hdfs hadoop fs -mkdir /tmp
shell> sudo -u hdfs hadoop fs -chmod -R 1777 /tmp
8.配置日志目录
步骤2的配置的日志目录:
<property>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>/var/log/hadoop-yarn/apps</value>
<description>Where to aggregate logs to(HDFS)</description>
</property>
应该创建/var/log/hadoop-yarn/目录
shell> sudo -u hdfs hadoop fs -mkdir /var/log/hadoop-yarn
shell> sudo -u hdfs hadoop fs -chown yarn:mapred /var/log/hadoop-yarn
9.检查HDFS文件系统结构
shell> sudo -u hdfs hadoop fs -ls -R /
drwxrwxrwt – hdfs supergroup 0 2012-04-19 14:31 /tmp
drwxr-xr-x – hdfs supergroup 0 2012-05-31 10:26 /user
drwxrwxrwt – yarn supergroup 0 2012-04-19 14:31 /user/history
drwxr-xr-x – hdfs supergroup 0 2012-05-31 15:31 /var
drwxr-xr-x – hdfs supergroup 0 2012-05-31 15:31 /var/log
drwxr-xr-x – yarn mapred 0 2012-05-31 15:31 /var/log/hadoop-yarn
10.启动YARN集群
先启动RM:
shell> service hadoop-yarn-resourcemanager start
在启动每个NodeManager:
注意NM节点上还要:shell> yum install hadoop-mapreduce
否则日志中会报错如下:
2013-11-05 20:48:39,577 FATAL org.apache.hadoop.yarn.server.nodemanager.NodeManager: Error starting NodeManager
java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.mapred.ShuffleHandler not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1649)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.init(AuxServices.java:90)
shell> service hadoop-yarn-nodemanager start
启动JobHistory Server系统:
shell> service hadoop-mapreduce-historyserver start
11.设置HADOOP_MAPRED_HOME
对于要使用YARN提交job的用户,或者运行Pig, Hive, Sqoop等的环境下,设置HADOOP_MAPRED_HOME环境变量:
export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce
例如hive就在/etc/hive/conf/hive-env.sh中添加。