Impala Returns NULL Values For Some Columns in Parquet Table

0 votes
3 views
asked Aug 30, 2017 in Hadoop by admin (4,410 points)
SummaryOne possible cause for Impala to return NULL values for some of the columns stored as Parquet format.

Symptoms

When running SELECT query against a Parquet table from Impala, some or all of the columns returns NULl values, like below:

Query: select * from impalaissue
+----------+----------+----------+
| myfield1 | myfield2 | myfield3 |
+----------+----------+----------+
| NULL     | NULL     | NULL     |
| NULL     | NULL     | NULL     |
| NULL     | NULL     | NULL     |
+----------+----------+----------+

The same query returned correct results in Hive/Beeline:

select * from impalaissues;
+-----------------------+-----------------------+-----------------------+--+
| impalaissue.myfield1  | impalaissue.myfield2  | impalaissue.myfield3  |
+-----------------------+-----------------------+-----------------------+--+
| Hello                 | 2                     | B                     |
| Hola                  | 1                     | A                     |
| Ola                   | 3                     | C                     |
+-----------------------+-----------------------+-----------------------+--+
Applies To

Impala
CDH5.8.x +
Parquet

Cause

From CDH5.8.x and above, Impala contains a new setting to control how mapping of schema information from HiveMetaStore to schema stored in the Parquet format. The setting is called "PARQUET_FALLBACK_SCHEMA_RESOLUTION". For more information, please refer to: PARQUET_FALLBACK_SCHEMA_RESOLUTION Query Option (CDH 5.8 or higher only)

When "PARQUET_FALLBACK_SCHEMA_RESOLUTION=name", Impala will search for columns in Parquet file using column names, instead of positions/indexes. And because Impala does this in case-sensitive way, if the column names in Parquet are in the format of "ColumnName", but HiveMetaStore stores column name as "columnname", Impala will not be able to find a match in Parquet schema, hence NULL values will be returned.

Impala Schema:

+--------------------------------------------------------------------------------+
| result                                                                         |
+--------------------------------------------------------------------------------+
| CREATE EXTERNAL TABLE default.impalaissue (                                    |
|   myfield1 STRING COMMENT 'Inferred from Parquet file.',                       |
|   myfield2 BIGINT COMMENT 'Inferred from Parquet file.',                       |
|   myfield3 STRING COMMENT 'Inferred from Parquet file.'                        |
| )                                                                              |
| STORED AS PARQUET                                                              |
| LOCATION 'hdfs://<namenode>:8020/tmp/data'                                     |
+--------------------------------------------------------------------------------+

Parquet schema:

message example {
  optional binary MyField1 (UTF8);
  optional int64 MyField2;
  optional binary MyField3 (UTF8);
}

This has been reported upstream: IMPALA-4675 - Mixed or uppercase columns are not resolved in parquet when using PARQUET_FALLBACK_SCHEMA_RESOLUTION=NAME

Instructions

Set PARQUET_FALLBACK_SCHEMA_RESOLUTION to use "position" instead of "name" to fix the issue for the time being:

SET PARQUET_FALLBACK_SCHEMA_RESOLUTION=position;

If you can't use "position" due to the order of columns in HiveMetaStore and Parquet schema are not matching, you will have to wait for IMPALA-4675 to be fixed.

Please log in or register to answer this question.

...