Hadoop配置参数介绍

Hadoop配置参数介绍

HowToConfigure

How To Configure Hadoop

Primary XML Files

Hadoop is configured with a set of files. The files are loaded in the order listed in the table below, with the lower files in the table overriding the higher ones:

Filename

Description

hadoop-default.xml

Generic default values

mapred-default.xml

Site specific default values

job.xml

Configuration for a specific map/reduce job

hadoop-site.xml

Site specific value that can not be modified by the job

Look up path

Configuration files are found via Java's Classpath. Only the first instance of each file is used. The $HADOOP_CONF_DIR is added by the bin/hadoop script to the front of the path. When installing Hadoop on a cluster, it is best to use a conf directory outside of the distribution. That allows you to easily update the release on the cluster without changing your configuration by mistake.

hadoop-default.xml

This file has the default values for many of the configuration variables that are used by Hadoop. This file should never be in $HADOOP_CONF_DIR so that the version in the hadoop-*.jar is used. (Otherwise, if a variable is added to this file in a new release, you won't have it defined.)

mapred-default.xml

This file should contain the majority of your site's customization of Hadoop. Although this file name is prefixed with mapred, the default settings for the user maps and reduces are controlled by it.

Some useful variables are:

Name

Meaning

dfs.block.size

size in bytes of each data block in DFS

io.sort.factor

number of files input to each level in the merge sort

io.sort.mb

size of buffer to sort the reduce inputs in

io.file.buffer.size

number of bytes used for buffering io files

mapred.reduce.parallel.copies

number of threads fetching map outputs for each reduce

dfs.replication

number of replicas for each DFS block

mapred.child.java.opts

options passed to child task jvms

mapred.min.split.size

minimum number of bytes in a map input split

mapred.output.compress

Should the reduce outputs be compressed?

job.xml

This file is never created explicitly by the user. The map/reduce application creates a Hadoop配置参数介绍 JobConf, which is serialized when the job is submitted.

hadoop-site.xml

This file overrides any settings in the job.xml and therefore should be very minimal. Usually it just contains the address of the NameNode, the address of the JobTracker, and the port and working directories for the various servers.

Environment Variables

For the most part, you should only need to define $HADOOP_CONF_DIR. Other environment variables are defined in $HADOOP_CONF_DIR/hadoop-env.sh.

Variables in hadoop-env.sh, include:

Name

Meaning

JAVA_HOME

Root of the Java installation

HADOOP_HEAPSIZE

MB of heap for the servers

HADOOP_IDENT_STRING

User name of the cluster

HADOOP_OPTS

Extra arguments to the JVM

HADOOP_HOME

Hadoop release directory

HADOOP_LOG_DIR

Directory for log files

HADOOP_PID_DIR

Directory to store the PID for the servers

Log4j Configuration

Hadoop logs messages to Log4j by default. Log4j is configured via log4j.properties on the classpath. This file defines both what is logged and where. For applications, the default root logger is "INFO,console", which logs all message at level INFO and above to the console's stderr. Servers log to the "INFO,DRFA", which logs to a file that is rolled daily. Log files are named $HADOOP_LOG_DIR/hadoop-$HADOOP_IDENT_STRING-<server>.log.

For Hadoop developers, it is often convenient to get additional logging from particular classes. If you are working on the TaskTracker, for example, you would likely want

  • log4j.logger.org.apache.hadoop.mapred.TaskTracker=DEBUG

in your log4j.properties.