The error mentioned is spark binding error for the job running in Client mode.
Spark job in client mode tries to establish connection with a dedicated port and if its not able to establish connection then it retries for 16 times before it fails.
Ideally the job establishes its connection but during contention having large set of spark jobs running the other has to wait and after 16 retries it fails.
Suggestion : For debugging in DEV use spark submit use client mode , For all schedule / batch execution run the spark submit in Cluster mode.
Do NOT to run spark-driver on edge node. This can be achieved using --deploy-mode cluster parameter in spark-submit.
Why? To avoid race condition caused by parallel job triggered on same time while binding http port for spark UI. Also, it helps to distribute the workload from edge node to other nodes on the cluster.
Configuration to disable spark UI: --conf spark.ui.enabled=false
[ This is on discretion of application team based on need of Spark UI. When application is running as scheduled job usually no active monitoring or inspection is done, hence spark.ui is not quite needed. By disabling the UI, it will not be available for the duration of job execution, however will be available post completion on history server. All the logs can be extracted using yarn logs -applicationId <application ID> ]
Example: spark-submit --conf spark.ui.enabled=false --class com.ms.bi.hadoop.samples.SparkApp --master yarn --deploy-mode cluster --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 <pathname/filename>