Hive Temporary Scratch Files Fill Disk Space

0 votes
1 view
asked Aug 25, 2017 in Hadoop by admin (4,410 points)
SummaryHigh HDFS disk space utilization in the Hive temporary scratch directory. Scratch files from completed jobs can fill up disk space (high HDFS disk space utilization). This article provides workarounds to this situation.

 

Applies To

Apache Hive

Symptoms

 

Cause
Hive uses temporary folders on the default HDFS instance. These folders are used to store per-query temporary/intermediate data sets and are normally cleaned up by the Hive client when the query is finished. However, in cases of abnormal Hive client termination, some data may be left behind.
This scratch directory path is defined by the configuration: hive.exec.scratchdir
  • Default Value: /tmp/${user.name} in Hive 0.2.0 through 0.8.0
  • Default Value: /tmp/hive-${user.name} in Hive 0.8.1 through 0.14.0;
  • Default Value: /tmp/hive in Hive 0.14.0 and later
Hive 0.14.0 and later: For each connecting user, an HDFS scratch directory ${hive.exec.scratchdir}/<username>

For example, Hive session directories are created in the HDFS scratch directory when a client connects to the HiveServer2 and initiates a session. If the session is not closed properly by the client, its session directory will linger. Sessions can be automatically managed with Hive Server session timeouts.

Instructions
Remove unneeded directories and files of completed jobs within the Hive temporary scratch directory.
Warning:  Any actively running Hive jobs may be using this directory and deleting active scratch files will cause failure. Only delete files and directories which can safely be assumed to no longer be associated with a currently executing Hive job.

Monitoring Unneeded Directories and Files

 

Option 1:

Use Hadoop HDFS Fuse (Filesystem in Userspace) to monitor the unneeded directories. Fuse uses an HDFS cluster as if it were a traditional filesystem on Linux. In this way, the temporary scratch directory can monitor for cleanup using standard Linux techniques.

Option 2:

Create a cron job which uses the HdfsFindTool to search the file system looking for files older than a certain age. These directories can be scheduled for deletion. The following example uses the HdfsFindTool to locate these old files and the output of the tool is forwarded to the hdfs dfs command to remove them permanently. 
 
# Depending on Hive version

yarn jar /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar org.apache.solr.hadoop.HdfsFindTool 
-find '/tmp/hive-*' -type d -name 'hive_*' -mtime +7 | xargs hdfs dfs -rm -r -f -skipTrash

yarn jar /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar org.apache.solr.hadoop.HdfsFindTool 
-find '/tmp/hive/' -type d -name 'hive_*' -mtime +7 | xargs hdfs dfs -rm -r -f -skipTra

Please log in or register to answer this question.

...