We are going to setup all the NameNode, DataNode, ResourceManager and NodeManager on a single machine.
Step 1: Create user and group:
groupadd hadoop useradd hadoop -g hadoop
setup passwordless SSH, refer to http://www.techguru.my/linux-admin/ssh/passwordless-ssh/
Step 2: Install java sdk:
yum install java
Step 3: Download the Hadoop Package.
Command: wget https://archive.apache.org/dist/hadoop/core/hadoop-3.1.0/hadoop-3.1.0.tar.gz
Step 4: Extract the Hadoop tar File and setup symlinks
Command: cd /usr/local tar -xvf hadoop-3.1.0.tar.gz ln -s hadoop-3.1.0 hadoop ln -s /usr/local/hadoop/etc/hadoop /etc/hadoop chown -R hadoop:hadoop hadoop
Step 5: Add the Hadoop and Java paths
Open /etc/profile file. Now, add Hadoop and Java Path as shown below:
export JAVA_HOME=/etc/alternatives/java_sdk_1.8.0 export HDFS_NAMENODE_USER="hadoop" export HDFS_DATANODE_USER="hadoop" export HDFS_SECONDARYNAMENODE_USER="hadoop" export YARN_RESOURCEMANAGER_USER="hadoop" export YARN_NODEMANAGER_USER="hadoop" export HADOOP_HOME=/usr/local/hadoop export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop export HADOOP_MAPRED_HOME=/usr/local/hadoop export HADOOP_COMMON_HOME=/usr/local/hadoop export HADOOP_HDFS_HOME=/usr/local/hadoop export YARN_HOME=/usr/local/hadoop export HADOOP_SSH_OPTS="-p 22 -l hadoop" export PATH=$PATH:/usr/local/hadoop/bin:/etc/alternatives/java_sdk_1.8.0/bin Then, save the file and close it, and run: source /etc/profile
Step 6: Edit the Hadoop Configuration files
cd /etc/hadoop vi core-site.xml <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> </configuration> vi mapred-site.xml <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration> vi yarn-site.xml <configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.nodemanager.env-whitelist</name> <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value> </property> </configuration> vi hdfs-site.xml <configuration> <property> <name>dfs.namenode.name.dir</name> <value>/home/hadoop/name</value> <description>Determines where on the local filesystem the DFS name node should store the name table. If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy. </description> <final>true</final> </property> <property> <name>dfs.datanode.data.dir</name> <value>file///home/hadoop/data</value> <description>Determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored. </description> <final>true</final> </property> <property> <name>dfs.heartbeat.interval</name> <value>3</value> <description>Determines datanode heartbeat interval in seconds. </description> </property> <property> <name>dfs.safemode.threshold.pct</name> <value>1.0f</value> <description> Specifies the percentage of blocks that should satisfy the minimal replication requirement defined by dfs.replication.min. Values less than or equal to 0 mean not to start in safe mode. Values greater than 1 will make safe mode permanent. </description> </property> <property> <name>dfs.datanode.address</name> <value>0.0.0.0:1004</value> </property> <property> <name>dfs.datanode.http.address</name> <value>0.0.0.0:1006</value> </property> <property> <name>dfs.http.address</name> <value>0.0.0.0:50070</value> <description>The name of the default file system. Either the literal string "local" or a host:port for NDFS. </description> <final>true</final> </property> <property> <name>dfs.datanode.ipc.address</name> <value>0.0.0.0:8025</value> <description> The datanode ipc server address and port. If the port is 0 then the server will start on a free port. </description> </property> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.permission</name> <value>false</value> </property> <property> <name>dfs.umaskmode</name> <value>077</value> <description> The octal umask used when creating files and directories. </description> </property> <property> <name>dfs.datanode.data.dir.perm</name> <value>700</value> <description>The permissions that should be there on dfs.data.dir directories. The datanode will not come up if the permissions are different on existing dfs.data.dir directories. If the directories don't exist, they will be created with this permission.</description> </property> </configuration>
Step 7: Edit hadoop-env.sh and add the Java Path as mentioned below:
hadoop-env.sh contains the environment variables that are used in the script to run Hadoop like Java home path, etc.
Command: vi hadoop–env.sh
export JAVA_HOME=/etc/alternatives/java_sdk_1.8.0
Step 8: Go to Hadoop home directory and format the NameNode.
Command: cd Command: cd hadoop Command: bin/hadoop namenode -format This formats the HDFS via NameNode. This command is only executed for the first time. Formatting the file system means initializing the directory specified by the dfs.name.dir variable. Never format, up and running Hadoop filesystem. You will lose all your data stored in the HDFS.
Step 9: Once the NameNode is formatted, go to hadoop/sbin directory and start/stop all the daemons.
Command: cd hadoop/sbin Either you can start all daemons with a single command or do it individually. Command: ./start-all.sh, ./stop-all.sh To start services individually:Start/Stop NameNode:
The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files stored in the HDFS and tracks all the file stored across the cluster. Command: ./hadoop-daemon.sh start/stop namenodeStart/Stop DataNode:
On startup, a DataNode connects to the Namenode and it responds to the requests from the Namenode for different operations. Command: ./hadoop-daemon.sh start/stop datanodeStart/Stop ResourceManager:
ResourceManager is the master that arbitrates all the available cluster resources and thus helps in managing the distributed applications running on the YARN system. Its work is to manage each NodeManagers and the each application’s ApplicationMaster. Command: ./yarn-daemon.sh start/stop resourcemanagerStart/Stop NodeManager:
The NodeManager in each machine framework is the agent which is responsible for managing containers, monitoring their resource usage and reporting the same to the ResourceManager. Command: ./yarn-daemon.sh start/stop nodemanagerStart/Stop JobHistoryServer:
JobHistoryServer is responsible for servicing all job history related requests from client. Command: ./mr-jobhistory-daemon.sh start/stop historyserver
Step 10: Check if all services are running
prompt> jps 53088 SecondaryNameNode 53940 Jps 53620 NodeManager 52663 NameNode 53211 DataNode Important, start all services using root user, when you run command: jps, you will see the below: 62272 NodeManager 61680 SecondaryNameNode 62657 Jps 61333 DataNode 61125 NameNode 34533 ResourceManager if you login using hadoop user and run command: jps, you will only see: 62272 NodeManager 61680 SecondaryNameNode 61333 DataNode 61125 NameNode 62808 Jps Older hadoop version jobTracker and TaskTracker has been replaced by ResourceManager
Step 11: check hdfs and ResourceManager
run the following at command prompt to check disk status: hdfs fsck /hbase hdfs dfsadmin -report Visit the URL below to see Resource Manager Status: http://localhost:8088
NOTE:
if the DataNode is not running, run the below command to check:
/usr/local/hadoop/bin/hadoop datanode Error reported: java.net.SocketException: Call From 0.0.0.0 to null:0 failed on socket exception: java.net.SocketException: Permission denied; For more details see: http://wiki.apache.org/hadoop/SocketException This is because datanode is running of port 1004 and 1006. It needs to run on higher port e.g. 51004, 51006. Edit the datanode.address to high port solves the problem.
Leave a Reply