Long Running Spark Executors For Hive Sessions

0 votes
5 views
asked Aug 28, 2017 in Hadoop by admin (4,410 points)
SummaryHive sessions which use Hive on Spark (HOS) maintain a single instance of a Spark Executor as long as the session is open.

Symptoms

YARN resources are claimed by a Spark Application Master (YARN Spark Driver) and one or more Spark Executors which are not returned to YARN until their corresponding Hive session terminates.

User-added image

Note: The number of Spark Executors which will remain idle is configured by the Spark Dynamic Allocation parameter: spark.dynamicAllocation.minExecutors

Applies To
  • CDH 5.7.0 +

Please note that work is being performed as part of HIVE-14162 to address this issue.

Cause
beeline> !connect jdbc:hive2://hiveserver2.cloudera.com:10000/default
scan complete in 18ms
Connecting to jdbc:hive2://hiveserver2.cloudera.com:10000/default
Connected to: Apache Hive (version 1.1.0-cdh5.9.3-SNAPSHOT)
Driver: Hive JDBC (version 1.1.0-cdh5.9.3-SNAPSHOT)
Transaction isolation: TRANSACTION_REPEATABLE_READ

-- Setting Spark engine; Does not yet launch Spark application
0: jdbc:hive2://hiveserver2.clou> SET hive.execution.engine=spark;
No rows affected (0.049 seconds)

-- Query triggers creation of Spark Application Master and Spark Executors
-- Notice the time it takes to complete the query
0: jdbc:hive2://hiveserver2.clou> SELECT * FROM sample_07 WHERE salary > 100000;
28 rows selected (21.391 seconds)

-- Spark Application Master and Spark Executors are already running on the cluster
-- Notice the time it takes to complete is much less
0: jdbc:hive2://hiveserver2.clou> SELECT * FROM sample_07 WHERE salary > 100000;
28 rows selected (1.235 seconds)

-- Returning Hive session to MapReduce Engine terminates Spark Driver and Executor
0: jdbc:hive2://hiveserver2.clou> SET hive.execution.engine=mr;
No rows affected (0.017 seconds)
Instructions

When a Hive session is initiated, and a query is submitted to the Spark processing engine, Hive maintains one or more Spark Executors on the cluster until the session is terminated.  The initial setup of the Spark processing engine is time intensive.  To avoid the overhead of having to create a new Spark processing engine for each query submitted, Hive maintains a Spark Application Master (YARN Spark Driver) and one or more Spark Executors for each Hive session.  The trade-off however is that the Spark components will consume resources on YARN even though they may be in an idle phase, between queries, for long periods of time.

Until HIVE-14162 is addressed, users must terminate their Hive sessions to return the Spark resources to the YARN cluster or wait for Hive session timeouts.

These options include:

  1. Wait for Hive session to be terminated by HiveServer2 session timeout
  2. Explicitly terminate beeline shell once work has been completed
  3. Explicitly set Hive session processing engine back to MapReduce


Note: Since a user cannot explicitly terminate their shell with the Hue Hive Editor, only options #1 and #3 are effective for Hue users.

Please log in or register to answer this question.

...