This is the preferred method for Spark Installation, as Yarn is started as ResourceManager in hadoop, this reduces the transaction time required when installing Spark as standalone cluster. Below are the steps to install through this method. Prior to the installation steps below, please run Hadoop 3.x installation.
1. Download Spark Binary with hadoop 2.7 and above compatibility:
wget http://ftp.jaist.ac.jp/pub/apache/spark/spark-2.3.1/spark-2.3.1-bin-hadoop2.7.tgz
2. Extract the binaries:
tar -zxvf spark-2.3.1-bin-hadoop2.7.tgz mv spark-2.3.1-bin-hadoop2.7 /usr/local cd /usr/local ln -s spark-2.3.1-bin-hadoop2.7 spark chown -R spark-2.3.1-bin-hadoop2.7
3. Setup Environment Variables
vi ~/.bashrc export SPARK_HOME=/usr/local/spark export PATH=$PATH:/usr/local/hadoop/bin:$SPARK_HOME/bin:$HBASE_HOME/bin:/etc/alternatives/java_sdk_1.8.0/bin
4. ensure ResourceManager are properly setup:
prompt> jps 62272 NodeManager 61680 SecondaryNameNode 61333 DataNode 61125 NameNode 56965 Jps 34533 ResourceManager 60726 Elasticsearch 51976 HRegionServer 55531 Master 51660 HQuorumPeer
5. Create Spark configurations and configure to use Yarn cluster on cluster mode:
mv $SPARK_HOME/conf/spark-defaults.conf.template $SPARK_HOME/conf/spark-defaults.conf mv $SPARK_HOME/conf/spark-env.template $SPARK_HOME/conf/spark-env.sh vi $SPARK_HOME/conf/spark-defaults.conf spark.master yarn spark.driver.memory 5g spark.executor.memory 1g spark.eventLog.enabled true spark.eventLog.dir hdfs://localhost:9000/spark-logs spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider spark.history.fs.logDirectory hdfs://localhost:9000/spark-logs spark.history.fs.update.interval 10s spark.history.ui.port 18080 spark.submit.deployMode cluster spark.yarn.am.memory 1g spark.yarn.submit.file.replication 1 vi $SPARK_HOME/conf/spark-env.sh HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
6. Create the log directory in HDFS:
hdfs dfs -mkdir /spark-logs
7. Start the history server:
$SPARK_HOME/sbin/start-history-server.sh
8. Submit a Spark Application to the YARN Cluster:
spark-submit --class org.apache.spark.examples.SparkPi \ $SPARK_HOME/examples/jars/spark-examples_2.11-2.3.1.jar 10
9. Access the History Server:
navigating to http://localhost:18080 in a web browser
10. Spark Shell
The spark is configured to run on cluster mode, spark shell cannot be run under cluster mode. To run spark shell, this line must be commented: spark.submit.deployMode cluster
11. Spark Logs
Logs are keep according to application ID in the directories: $HADOOP_HOME/logs/userlogs/application_1474886780074_XXXX/ To check the application ID, run the below: yarn application -list If log aggregation is turned on (with the yarn.log-aggregation-enable yarn-site.xml) then do this yarn logs -applicationId
Note:
As the spark applications are running under Yarn ResourceManager, there is not need to run sbin/start-all.sh which is intended for standalone cluster mode. The application can be fully monitored under YARN Web UI for Yarn Resource manager: http://localhost:8088
if the below error occurs, this is because ResourceManager is unable to find the java in the /bin folder.
Last 4096 bytes of stderr : /bin/bash: /bin/java: No such file or directory prompt>which java /usr/bin/java ln -s /usr/bin/java /bin/java
Leave a Reply