Install Spark 2.3.x on YARN with Hadoop 3.x

This is the preferred method for Spark Installation, as Yarn is started as ResourceManager in hadoop, this reduces the transaction time required when installing Spark as standalone cluster. Below are the steps to install through this method. Prior to the installation steps below, please run Hadoop 3.x installation.

1. Download Spark Binary with hadoop 2.7 and above compatibility:

wget http://ftp.jaist.ac.jp/pub/apache/spark/spark-2.3.1/spark-2.3.1-bin-hadoop2.7.tgz

2. Extract the binaries:

tar -zxvf spark-2.3.1-bin-hadoop2.7.tgz
mv spark-2.3.1-bin-hadoop2.7 /usr/local
cd /usr/local
ln -s spark-2.3.1-bin-hadoop2.7 spark
chown -R spark-2.3.1-bin-hadoop2.7

3. Setup Environment Variables

vi ~/.bashrc
export SPARK_HOME=/usr/local/spark
export PATH=$PATH:/usr/local/hadoop/bin:$SPARK_HOME/bin:$HBASE_HOME/bin:/etc/alternatives/java_sdk_1.8.0/bin

4. ensure ResourceManager are properly setup:

prompt> jps
62272 NodeManager
61680 SecondaryNameNode
61333 DataNode
61125 NameNode
56965 Jps
34533 ResourceManager
60726 Elasticsearch
51976 HRegionServer
55531 Master
51660 HQuorumPeer

5. Create Spark configurations and configure to use Yarn cluster on cluster mode:

mv $SPARK_HOME/conf/spark-defaults.conf.template $SPARK_HOME/conf/spark-defaults.conf
mv $SPARK_HOME/conf/spark-env.template $SPARK_HOME/conf/spark-env.sh

vi $SPARK_HOME/conf/spark-defaults.conf
spark.master                            yarn
spark.driver.memory                     5g
spark.executor.memory                   1g
spark.eventLog.enabled                  true
spark.eventLog.dir                      hdfs://localhost:9000/spark-logs

spark.history.provider                  org.apache.spark.deploy.history.FsHistoryProvider
spark.history.fs.logDirectory           hdfs://localhost:9000/spark-logs
spark.history.fs.update.interval        10s
spark.history.ui.port                   18080
spark.submit.deployMode                 cluster
spark.yarn.am.memory                    1g
spark.yarn.submit.file.replication      1

vi $SPARK_HOME/conf/spark-env.sh
HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop

6. Create the log directory in HDFS:

hdfs dfs -mkdir /spark-logs

7. Start the history server:

$SPARK_HOME/sbin/start-history-server.sh

8. Submit a Spark Application to the YARN Cluster:

spark-submit --class org.apache.spark.examples.SparkPi \
               $SPARK_HOME/examples/jars/spark-examples_2.11-2.3.1.jar 10

9. Access the History Server:

   navigating to http://localhost:18080 in a web browser

10. Spark Shell
The spark is configured to run on cluster mode, spark shell cannot be run under cluster mode. To run spark shell, this line must be commented: spark.submit.deployMode cluster

11. Spark Logs

Logs are keep according to application ID in the directories:
$HADOOP_HOME/logs/userlogs/application_1474886780074_XXXX/

To check the application ID, run the below:
yarn application -list

If log aggregation is turned on (with the yarn.log-aggregation-enable yarn-site.xml) then do this
yarn logs -applicationId 

Note:
As the spark applications are running under Yarn ResourceManager, there is not need to run sbin/start-all.sh which is intended for standalone cluster mode. The application can be fully monitored under YARN Web UI for Yarn Resource manager: http://localhost:8088

if the below error occurs, this is because ResourceManager is unable to find the java in the /bin folder.

Last 4096 bytes of stderr :
/bin/bash: /bin/java: No such file or directory

prompt>which java
/usr/bin/java

ln -s /usr/bin/java /bin/java

Be the first to comment

Leave a Reply

Your email address will not be published.


*