A Spark job running in YARN fails (finishes in FAILED status) and the below errors are present for the AM and executor container logs:
YARN application logs show a message similar to the following in the container running the Application Master:
17/07/05 07:23:14 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 11, (reason: Max number of executor failures (3) reached)
Errors relating to container exits are also present in the Application Master YARN logs - e.g.:
17/07/05 07:23:38 WARN yarn.YarnAllocator: Container marked as failed: container_1499143983924_0021_02_000011 on host: <fully_qualified_hostname>. Exit status: 1. Diagnostics: Exception from container-launch.
For YARN containers running executors, messages such as the following are present in the YARN application logs:
17/07/05 07:23:39 ERROR Remoting: Remoting error: [Startup failed] [
akka.remote.RemoteTransportException: Startup failed
Caused by: org.jboss.netty.channel.ChannelException: Failed to bind to: <fully_qualified_hostname>/<ip_address>:51235
Caused by: java.net.BindException: Address already in use
Service 'sparkExecutorActorSystem' failed
This can occur when spark.executor.port is set explicitly in the spark job configuration and spark.port.maxRetries is set too low or the default is too low for the number of executors or other port consuming processes running on a node.
When a Spark executor starts running on a Node Manager it will try to bind to a port to enable communication with the Spark driver. To find an unused port number, and executor will first try to bind to the port defined by spark.executor.port. If this fails (due to ports already being in use), it will try spark.executor.port + 1, then spark.executor.port + 2, etc. The executor will keep trying to bind to ports up to spark.executor.port + spark.port.maxRetries. If these attempts all fail the executor will exit.
If too many (spark.yarn.max.executor.failures) executors fail to start due to this, the application attempt will be terminated.
To resolve this, set or increase spark.port.maxRetries during job submission - for example:
spark-submit --class org.apache.spark.examples.SparkPi --conf spark.port.maxRetries=1000 --conf spark.executor.port=51000 --deploy-mode cluster --master yarn /opt/cloudera/parcels/CDH-5.10.1-1.cdh5.10.1.p0.10/jars/spark-examples-1.6.0-cdh5.10.1-hadoop2.6.0-cdh5.10.1.jar 1000000
spark.port.maxRetries should be set large enough such that each executor is able to find an unused port on which to bind. The number should take into account to the maximum number of executors which are likely to be running on the node and any other applications or processes which could be using the port range defined by spark.executor.port --> spark.executor.port + spark.port.maxRetries (e.g. port range 51000-52000 in the above example).