Impala Reported Corrupt Parquet File After Failed With OutOfMemory Error

0 votes
2 views
asked Aug 28, 2017 in Hadoop by admin (4,410 points)
Summary
Applies To
  • Impala
  • Parquet
Symptoms

Impala reported Corrupt Parquet File after it failed with OutOfMemory error, however, if it does not fail, no corruption will be reported:

Memory limit exceeded
HdfsParquetScanner::ReadDataPage() failed to allocate 65535 bytes for decompressed data.
Corrupt Parquet file 'hdfs://nameservice1/path/to/file/914164e7120e6076-cdae1be60000001f_169433548_data.0.parq': column 'client_ord_id' had 1024 remaining values but expected 0 _
[Executed: 4/29/2017 5:28:58 AM] [Execution: 588ms]
When an impala query failed with OOM error, it also reported corrupted parquet file:

HdfsParquetScanner::ReadDataPage() failed to allocate 65535 bytes for decompressed data.
Corrupt Parquet file 'hdfs://nameservice1/path/to/file/914164e7120e6076-cdae1be60000001f_169433548_data.0.parq': column 'client_ord_id' had 1024 remaining values but expected 0 _
[Executed: 4/29/2017 5:28:58 AM] [Execution: 588ms]

Cause

This is reported in the upstream JIRA: IMPALA-5197, this can happen in the following scenarios:

  • Query failed with OOM error
  • There is a LIMIT clause in the query
  • Query is manually cancelled by the user
Instructions

Those corrupt messages do not mean the file is really corrupted, it is caused by an Impala bug that mentioned earlier IMPALA-5197.

If it is caused by OutOfMemory error, simply increase the memory limit for the query and try again:

SET MEM_LIMIT=10g;

For the other two causes, we will need to wait for IMPALA-5197 to be fixed.

Please log in or register to answer this question.

...